Fitting Data Points With Weighted Least Squares

About 6 years ago I wrote about how to fit curves, surfaces, volumes, and hypervolumes to data points in an “online algorithm” version of least squares – that is, you could update the fit as new data points come in.

In this post, I’m going to show how to adjust the least squares equations to be able to give weights for how important data points are in the fit. We’ll be looking at curves only, and not focusing on the “online algorithm” aspect, but this will work for both surfaces/volumes as well as the online updating.

The simple standalone C++ code that goes with this post can be found at

If you google weighted least squares, or look for it on youtube, almost every example you are likely to find talks about it being used when the variance of data points are different in different parts of the graph, and that is used to make points contribute more or less to the final fit.

That is a pretty neat usage case, but if you are just trying to fit a curve to data points as a way to compress data or optimize calculations, you probably are not really concerned with the variance of the data points you are trying to fit. Luckily, weighting data points in least squares is pretty simple.

Here is the equation you solve in ordinary least squares (OLS):

A^TAx = A^Ty

where A is a matrix that contains all the powers of x of each data point, y is a vector that contains the y values of each data point, and x is polynomial coefficients that we are trying to solve for.

A pretty straightforward way to solve this equation for x is to make an augmented matrix where the left side is A^TA, and the right side is A^Ty.

\left[\begin{array}{r|r} A^TA & A^Ty \end{array}\right]

From there you can do Gauss-Jordan elimination to solve for x ( At the end, the left side matrix will be the identity matrix, and the right side vector will be the value for x.

\left[\begin{array}{r|r} I & x \end{array}\right]

Adjusting this formula for weighting involves introducing a weight matrix called W. W has a zero in every element, except along the main diagonal where the weights of each data point are. W_{ii} = \text{weight}_i.

There is something called “generalized least squares” (GLS) where the weight matrix is a covariance matrix, allowing more complex correlations to be expressed with non zeroes off of the main diagonal, but for weighted least squares, the main diagonal gets the weights of the data points.

The equation to solve for weighted least squares (WLS) becomes this:

A^TWAx = A^TWy

You can make an augmented matrix like before:

\left[\begin{array}{r|r} A^TWA & A^TWy \end{array}\right]

And use Gauss-Jordan elimination again to solve for x:

\left[\begin{array}{r|r} I & x \end{array}\right]


Here are a few runs of the program which again is at

Starting with 3 data points (0,0) (1,10), (2,2) all with the same weighting, we fit them with a line and get y = x+3. It’s interesting to see that data points 0 and 2 each have an error of +3 while data point 1 has an error of -6.

If we set the weighting of the 2nd point to 0.0001, we can see that it absorbs most of the error, because the weighting says that it is 10,000 times less important than the other data points.

Alternately, we can give it a weight of 10.0 to make it ten times as important as the other two data points. The error becomes a lot smaller for that data point due to it being more important.

If we switch to a quadratic curve, all three points can be well fit, with the same weights.

The example code works in any degree and also shows how to convert from these power basis polynomials to Bernstein basis polynomials, which are otherwise known as Bezier curve control points (more info on that here: The weighting you see in curves is not the same as the weighting here. When you convert the polynomial to a Bezier curve, it’s an unweighted curve. Weighted Bezier curves are made by dividing one Bezier curve by another and are called “Rational” curves, where unweighted curves are called “Integral” curves. Rational (weighted) curves are able to express more shapes than integral curves can, even when those integral curves are made by weighted least squares.


While we used weighted least squares to give different weightings to different data points, the main motivation you find on the internet seems to be about having different levels of confidence in the data points you get. This makes me wonder if we could use this in graphics, using 1/PDF(x) as the weighting in the least squares fitting like we do in importance sampled monte carlo integration. This seems especially useful if using this as an “online algorithm” that was improving a fit as it got more samples, either spatially or temporally.

Sampling Importance Resampling

The last post was on weighted reservoir sampling, one of the main techniques used in the 2020 paper “Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting” ( This post is on the other main technique used, which is sampling importance resampling or SIR.

SIR allows you to have a list of items from probability distribution A, and draw items from it from probability distribution B.

The C++ code that goes with this post is at

SIR Algorithm

The algorithm starts with a list of values drawn from probability distribution A, assigning each value a weight of 1.

Next, you divide the weight of each item by the PDF for that item in distribution A.

You then multiply the weight of each item by the PDF for that item in distribution B, the target distribution.

Then you normalize the weights of all items in the list so they sum to 1.0.

When you use that normalized weight as a probability for choosing each item, you are drawing values from probability distribution B, even though probability distribution A was used to generate the list.

Example and Intuition

An example really helps understand this I think.

We’ll start with a small list:

4, 5, 5

Choosing an item from the list is twice as likely to give you a 5 than a 4 because there are twice as many of them. When we do uniform sampling on that list, it’s like we are doing weighted sampling of the list, where all the weights are 1. let’s put those weights in the list.

4 (1), 5 (1), 5 (1)

The probability for choosing each value in that list is this:

4: 1/3

5: 2/3

You divide the weights by those probabilities to get:

4 (3), 5 (3/2), 5 (3/2)


4 (3), 5 (1.5), 5 (1.5)

Which we can normalize to this:

4 (0.5), 5 (0.25), 5 (0.25)

If you do a weighted random selection from this list using those weights, you can see that you are now just as likely to get a 4, as you are a 5. So, dividing the weights by the probabilities gave us a uniform distribution.

Now, let’s say our target distribution is that choosing a 4 should be three times as likely as choosing a 5. Here are the probabilities for that:

4: 0.75

5: 0.25

We now multiply the weights by these probabilities and get this:

4 (0.375), 5 (0.0625), 5 (0.0625)

Normalizing those weights, we get this:

4 (0.75), 5 (0.125), 5 (0.125)

If you sample from this list using those weights, there is a 75% chance you’ll get a 4, and a 25% chance you’ll get a 5.

So, even though we made this list by sampling such that a 5 was twice as likely as a 4, we can now sample from it so that a 4 is three times more likely than a 5.

The intuition here is that by dividing the weights by the probabilities that generated the list, we made the list uniform. When we multiplied the weights by the probabilities of the new distribution, it made the list take on that distribution. Not too complicated right?


Here we use this method to generate 100,000 uniform random float values between -1 and 1, and use this technique to sample 10000 sigma 0.1 Gaussian random numbers from it. This is the starting histogram, and the sampling importance resampled values histogram.

Here we start with a sigma 1.0 Gaussian distributed list of values, and resample it into a y=3x^2 PDF.

It doesn’t always work out though. This method can only output values that are in the starting list, so if the starting list doesn’t cover the full range of the target PDF, you won’t get values in the full range of the target PDF. Here we generate uniform random values between -0.1 and 0.3, and then try to resample that into a sigma 0.1 Gaussian. We can only output numbers that exist between -0.1 and 0.3, which isn’t the full range of the Gaussian.

This technique can also hit problems if there aren’t enough unique values in the source list. Here we generate only 500 uniform random samples, instead of the 100,000 that we did for the previous tests, and try to turn it into a Gaussian.

Here we do the non failure uniform to gaussian test in reverse with the same number of start and end samples. It seems as though the rare events at the tail ends of the Gaussian really explode…

Other Stuff

When doing the SIR algorithm, we normalized the weights to be able to a random selection from the list. The normalized weight of an item was it’s probability of being chosen.

This means that you have to do multiple passes through the list: Once to get the sum of the weights, then again to normalize them. Now you can select items from the list randomly.

The last blog post showed us how to do a fair random selection of a weighted list in one pass though, where the weights didn’t need to be normalized. So yep, if you use reservoir sampling to sample from the weighted list, you don’t need to normalize the weights, and you can do the random weighted selection without any preprocessing. You can even calculate the weight for the item the very first time you see it (like as the result of a ray trace?), as you (fairly) consider it for weighted random selection. How cool is that?

Further, the list of samples you choose to take before resampling the importance sampling might be uniform, but it doesn’t need to be. Maybe you could do something like MIS, where you select items using one distribution, then of that distribution, you find the best item using a second distribution? Not sure if that makes sense or not.

Lastly, while talking about uniform and non uniform distributions, white noise is implied, but it need not be white noise. Gaussian blue noise is a thing, or uniform red noise. There should also be ways to use different noise colors on the input and the output, or do weighting (scoring) to promote low discrepancy or other scoring functions that have some sort of memory or awareness of other samples.

That got off onto a tangent a bit, but the point is, there is a lot of possibility here.

If you understood this post and the last, and want more, give the paper a read. You’ll be able to understand what it’s talking about, and it goes a bit deeper into some related topics and implementation details.

“Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting”

Thanks for reading!


Stack Exchange:

Picking Fairly From a List of Unknown Size With Reservoir Sampling

There was an awesome graphics paper in 2020 that put together some fun and interesting statistics ideas to try and solve the “many lights” problem of ray tracing. The many lights problem is that when you are trying to shade a pixel on a surface, you technically need to shoot a ray at every light to see if it is shining onto the surface or not. This is relatively ok for a handful of lights, but becomes a big problem when you have 100,000 lights. The method in the paper has each pixel try a couple samples, share what was found with neighbors, and share with itself next frame to have some memory between frames so that it can improve where it samples over time.

The paper is a good read and you should check it out: “Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting”

In this post, we are going to be talking about one of the methods the paper uses, which is reservoir sampling. This allows you to stream a list of data and fairly (uniform randomly) choose an item from the list as you go. This also works when you want to choose items in the list with different probabilities. You can also just choose a random subset of the list to look at, do the same process, and the results aren’t bad.

This is useful in the many lights problem because it allows a random subset of lights to be considered by each pixel, each frame, and over time they find the best lights to sample with limited ray counts.

There are lots of other uses for reservoir sampling though, within rendering and game development, and outside of it too.

The 237 line single file C++ program that made the data for this post can be found at:

Uniform Reservoir Sampling

Choosing an item uniform randomly from a streaming list of unknown size is surprisingly easy. Starting at index 1, the chance of selecting an item is 1/index. This means that index 1 is chosen when seen, then there is a 50% chance (1/2) to select index 2 instead. After that, there is a 33% chance (1/3) to select index 3 instead, and a 25% chance (1/4) to select index 4, and so on.

The mathematics of why this works is illustrated well in this youtube video:

int index = 0;
int chosenValue = 0;

  int nextValue = GetNextValue();
  float chance = 1.0f / float(index);
  if (GetRandomFloat01() < chance)
    chosenValue = nextValue;

Weighted Reservoir Sampling

Adding non uniform sampling to the algorithm is not much more difficult. The chance to choose a specific item becomes the weight of that item, divided by the sum of the weights of the items seen so far. The weights don’t have to be a normalized PDF or PMF, you can use weights of any scale. If you use a weight of “1” for all the items, it becomes the same as uniform reservoir sampling.

float weightSum = 0.0f;
int chosenValue = 0;

  item nextValue = GetNextValue();
  weightSum += nextValue.weight;
  float chance = nextValue.weight / weightSum;
  if (GetRandomFloat01() < chance)
    chosenValue = nextValue.value;
PDF(x) =3x^2

Subset Uniform Reservoir Sampling

If you are trying to uniform randomly choose an item from a list that you know the size of, but is just too big, you can do uniform reservoir sampling on a random subset of the list. The larger the random subset, the better the results, but smaller subsets can work too – especially if you are amortizing a search of a larger subset over time.

int chosenValue = 0;

for (int index = 1; index <= sampleCount; ++index)
  int nextValue = GetRandomListValue();
  float chance = 1.0f / float(index);
  if (GetRandomFloat01() < chance)
    chosenValue = nextValue;
10% of the list was sampled in each test

Subset Weighted Reservoir Sampling

Adding weighting to the subset reservoir sampling is pretty easy too.

float weightSum = 0.0f;
int chosenValue = 0;

for (int index = 1; index <= sampleCount; ++index)
  item nextValue = GetRandomListValue();
  weightSum += nextValue.weight;
  float chance = nextValue.weight / weightSum;
  if (GetRandomFloat01() < chance)
    chosenValue = nextValue;
10% of the list was sampled in each test. PDF(x) =3x^2

Combining Reservoirs

If you have multiple reservoirs, they can be combined. The paper mentioned uses this to combine last frame reservoir samples with current frame reservoir samples.

void CombineReservoirs(int chosenValueA, float weightSumA, int chosenValueB, float weightSumB, int& newChosenValue, float& newWeightSum)
  newWeightSum = weightSumA + weightSumB;
  float chanceA = weightSumA  / newWeightSum;
  if (GetRandomFloat01() < chanceA)
    newChosenValue = chosenValueA;
    newChosenValue = chosenValueB;

For uniform sampling in the above, weightSum is the number of items seen. It’s as if every item had a weight of 1.

Choosing Multiple Items

If you want to choose N items instead of just 1, you could run the single item selection code N times in parallel to choose those N items (Note: you could have the same item chosen multiple times). If you do that, you’ll notice that the weight sum / count will be the same for each of them, so you don’t actually need to store that per each and it can be shared. That will give you something like this for uniform reservoir sampling.

int index = 0;
int chosenValues[N];

  int nextValue = GetNextValue();
  float chance = 1.0f / float(index);
  for (int i = 0; i < N; ++i)
    if (GetRandomFloat01() < chance)
      chosenValue[i] = nextValue;

Image Sharpening Convolution Kernels

Many people know that when you blur an image, you are applying a low pass filter that removes high frequencies.

From this, it’d be reasonable to expect that applying a high pass filter would sharpen an image, right?

Well, it turns out that is not the case! High pass filtering an image gives you everything that a low pass filter would remove from the picture, and it gives you ONLY that. Because of this, high pass filtering an image looks quite a bit different than you’d expect.

Our original test image, from NIST
High pass filtering by subtracting a low pass filter result (gaussian blur) from the original image.

So when you sharpen an image in something like photoshop, what is it doing?

It does do a high pass filter and then adds those high-frequency details to the original image, thus boosting the high-frequency content. It’s doing an “Unsharp mask” You may need to open the original image and the one below in separate browser tabs to flip back and forth and see the difference.

The high pass filtered result added to the original image.

The algorithm for sharpening an image is then:

  1. Blur an image using whatever blur you wish (e.g., Box, Tent, Gaussian)
  2. Subtract the blurred result from the original image to get high-frequency details.
  3. Add the high-frequency details to the original image.

Algebraically, this can be expressed as:

image + (image – blurred)


2 * image – blurred

Blurring is most commonly done by convolving an image with a low frequency kernel that sums to 1. If we are assuming that path to blurring, we can actually build a sharpening kernel which encodes the equation we just derived. For “image”, we’ll just use the identity matrix for convolution which is all zeros except a 1 in the center. That gives us this:

2 * identity – blur

If we wanted to make a 3×3 box blur into a sharpening filter we would calculate it this way:

2 * \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix} - \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} * \frac{1}{9} = \begin{bmatrix} -1 & -1 & -1 \\ -1 & 17 & -1 \\ -1 & -1 & -1 \end{bmatrix} * \frac{1}{9}

That makes this result:

You could also get a Gaussian blur kernel, like this one of Sigma 0.3 (calculated from, it’s already normalized so adds up to 1.0) and calculate a sharpening filter from that:

2 * \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix} - \begin{bmatrix} 0.0023 & 0.0432 & 0.0023 \\ 0.0432 & 0.8180 & 0.0432 \\ 0.0023 & 0.0432 & 0.0023 \end{bmatrix} = \begin{bmatrix} -0.0023 & -0.0432 & -0.0023 \\ -0.0432 & 1.182 & -0.0432 \\ -0.0023 & -0.0432 & -0.0023 \end{bmatrix}

That makes this result:

If you are wondering why a blur kernel has to add to 1, it technically doesn’t, but whatever it adds to is the brightness multiplier of the image it is being applied to. You can even use this to your advantage if you want to adjust the brightness while doing a blur. For instance, this kernel is a 3×3 box blur which also doubles the image brightness because it adds to 2.0.

\begin{bmatrix} 2/9 & 2/9 & 2/9 \\ 2/9 & 2/9 & 2/9 \\ 2/9 & 2/9 & 2/9 \end{bmatrix}

When using the formulation of 2 * identity – blur to calculate our sharpening filter, if blur sums to 1, and of course identity sums to 1, our equation becomes 2 * 1 – 1 = 1, so our sharpening filter also sums to 1, which means it doesn’t make the image brighter or darker. You could of course multiply the sharpening filter by a constant to make it brighten or darken the image at the same time, just like with a blur.

You might have noticed that the blur kernels only had values between 0 and 1 which meant that if we used it to filter values between 0 and 1 that our results would also be between 0 and 1 (so long as the blur kernel summed to 1, and we weren’t adjusting brightness).

In contrast, our sharpening filters had values that were negative, AND had values that were greater than 1. This is a problem because now if we filter values between 0 and 1, our results could be less than 0, or greater than 1. We need to deal with that by clamping and/or remapping the range of output to valid values (tone mapping). In the examples shown here, I just clamped the results.

This problem can come up in low pass filters too (like Lanczos which is based on sinc), but doesn’t with box or gaussian.

You might be like me and think it’s weird that a low pass filter (blur kernel) sums to 1, while a high pass filter sums to 0. Some good intuition I got from this on twitter (thanks Bart!) is that the sum of the filter is the filter response for a constant signal. For a low pass filter, you’d pass the constant (0 hz aka DC) signal through, while for a high pass filter, you’d filter it out.

The 170 lines of C++ code that made the images in this post can be found at:

The Discrete Cosine Transform, and Derivatives

First off, fuck Putin. I wish the world was giving more direct support to Ukraine against the invasion (It seems like today, that is starting to happen though luckily!). There’s too much tolerance happening for bad behavior IMO. Violence is terrible, and that’s why it has to be stopped as quickly and decisively as possible. You stop the trouble makers, you don’t just hope they’ll stop making trouble. Ukraine is fighting back hard and IMO we are all hoping they are successful, but they shouldn’t have to do this alone.

Onto the math!

For some reason, the discrete cosine transform (DCT) has been confusing to me for a long time, even though I have been intimately familiar with the discrete Fourier transform (DFT). I expected that there would be more to it than there was.

This post is a follow up to my last post, which talks about how to adjust the position of points in a point set to change the frequencies present in the point set. The last post is actually significantly more complex than this one!

There are two web demos that go with this post, that work like the demos from last post, but using the DCT instead of the DFT:

1D Signal DCT

Let’s start with a quick overview of the DFT. Here’s the formula for calculating the DFT of 1D signals.

X_k = \sum_{n=0}^{N-1} x_n * e^{2 \pi i k n/N}

N is the length of the signal, k is the frequency being evaluated (0 for DC, 1 for 1hz, 2 for 2hz, etc), n is the index of the current value in the signal, x_n is the value at that index, and X_k is the complex valued coefficient representing the phase and magnitude of frequency k in the signal.

You can also express the equation like this, which explicitly breaks the sum into a sum of imaginary and real parts:

X_k = \sum_{n=0}^{N-1} x_n * e^{2 \pi i k n/N} = \sum_{n=0}^{N-1} x_n * \cos(2 \pi k n/N) + x_n * i \sin(2 \pi k n/N)

From here, if you wanted to get the magnitude of the frequency in the signal, you’d treat the real and imaginary parts of the coefficient as x and y values of a vector and get the length of the vector. If you wanted to get the phase of the frequency (how much it is offset in the signal), you use atan2(imaginary, real).

Things get a lot simpler for the DCT: we only look at the real / cosine term:

X_k =\sum_{n=0}^{N-1} x_n * \cos(2 \pi k n/N)

All the symbols are the same except for X_k which is now a scalar value which is the frequency magnitude. You don’t get phase information like you do with the Fourier transform, but no more complex math (ha!) to calculate the magnitude.

That fact makes it a lot easier to get the derivative – or how changing a specific value index in the signal affects a specific frequency. If we want to know how changing the value at index m affects frequency magnitude X_k, all terms of the sum go to zero as constants except for the one involving index m, which makes the derivative this:

\frac{dX_k}{dx_m} = \cos(2 \pi k m/N)

You can gather up this value for frequency k for each of the N values and get a gradient which will tell you how to adjust all values in the signal to increase or decrease frequency k.

You can also gather up this value for index m, for each frequency k, to get a gradient that tells you how changing this signal value affects all frequencies.

1D Point Set DCT

We can change the DCT formula to be for sparse values in 1D, instead of a dense N valued signal.

X_k =\sum_{p \in P} \cos(2 \pi k p)

And once again, it’s super easy to get the derivative of, to know how much moving a specific point q affects the frequency.

\frac{dX_k}{dq} =-2 \pi k \sin(2 \pi k q)

That’s it, we are done!

You can see this in action here:

2D Signal DCT

To calculate a DCT of am MxN image, we have frequency j across the x axis, multiplied by frequency k across the y axis.

X_{jk} = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} x_{mn} * \cos(2 \pi j m/M) * \cos(2 \pi k n/N)

If we want to take the derivative of how this frequency pair changes as the sigal value at a specific location pq changes, it’s again pretty easy. All values in the sum are constants except for the one involving the value at pq. The cosine terms are also constants in this context.

\frac{dX_{jk}}{dx_{pq}} = \cos(2 \pi j p/M) * \cos(2 \pi k q/N)

2D Point Set DCT

We can change the 2D dense signal DCT into a 2D point set DCT that looks like this:

X_{jk} = \sum_{p \in P}  \cos(2 \pi j p_x) * \cos(2 \pi k p_y)

If we want to get the derivative of frequency X_{jk} as we move a specific 2D point q around, we need to take a partial derivative on the x axis, and a partial derivative on the y axis. This tells us how much the frequency magnitude changes as we move the point on the x and y axis.

\frac{X_{jk}}{dq_x} = -2 \pi j \sin(2 \pi j q_x) * \cos(2 \pi k q_y)

\frac{X_{jk}}{dq_y} = \cos(2 \pi j q_x) * -2 \pi k \sin(2 \pi k q_y)

You can see this in action here:

Differences Between DCT and DFT

There are some differences between using the DCT and DFT for frequency analysis and similar.

For one, the DFT assumes that the data you give it infinitely repeats. The DCT however assumes that the data you give it repeats forever too, but that each time it repeats, it is flipped like in a mirror.

Another difference is that because DFT has phase and DCT doesn’t, translation of data affects DCT frequency magnitude results, while it doesn’t affect DFT frequency magnitude results, but it does affect DFT phase results.

More concretely, imagine you have a 2hz cosine wave starting at x=0 (so, has a phase of 0 degrees). Both DFT and DCT will recognize this as a 2hz frequency with amplitude 1.

If you move this wave to the right so that it starts at x = pi/2, the DFT will still show a 2hz frequency with amplitude 1, but will now show a pi/2 phase. the DCT however, will show a 2hz frequency with amplitude -1!

If you play around in the demos that go with this post you can see this in action, that translation matters for DCT, but not for DFT, when looking at frequency magnitudes.

Lastly, DCT frequency magnitudes can be negative, where in DFT they can’t be negative. The demos are adjusted to account for this.

Once again, thanks for reading, and I hope you find this useful or at least interesting.

And here’s hoping Ukraine comes out on top soon, with friends coming to their aid and helping them rebuild. What a terrible and senseless situation.

Adjusting Point Sets in Frequency Space Using a Differentiable Fourier Transform

A couple years ago, Bart Wronski wrote an interesting blog post on how the Fourier transform was easily differentiable, which allowed using gradient descent to make blue noise textures. His post is here:

In this post I’ll show the details of how to do the same for points sets instead of textures.

There are two web demos that go along with this post. They let you make point sets in 1D and 2D, and move them around in either normal space, or frequency space.
1D Point Set Frequencies:
2D Point Set Frequencies:

NOTE: We are going to adjust frequency AMPLITUDES of point sets, but not frequency PHASES.

1D Fourier Transform of Point Sets

In a previous blog post I wrote about the discrete Fourier transform of 1D data ( which shows the formula:

X_k = \sum_{n=0}^{N-1} x_n * e^{2 \pi i k n/N}

As a quick recap, k is the frequency (1 for 1hz, 2 for 2hz, etc), X_k is the complex frequency information for that frequency, n is the index of the current value, x_n is the value at that index, and N is the total number of values.

If you have a 1D point set, one way to Fourier transform it is to make a 1D array of data, fill it with zeros, and then plot those points as ones into that 1D array. You then Fourier transform that array using the formula above.

It’s up to you what size to make that array. If you make the array smaller, the transform will be faster, but it will be less accurate in general. If the array is larger, the transform will be slower, but will be more accurate in general.

Here are some examples of different sizes arrays holding the points set (0.0, 0.25, 0.5, 0.75), where we assume the array holds the values [0,1):

  • 4: [1111]
  • 8: [10101010]
  • 9: [101001010]
  • 12: [100100100100]

See how the sized 9 array doesn’t have the points evenly spaced? If you have values that don’t evenly divide the array size, larger arrays make the plots more accurate.

Another way to transform these points though is to think of each of the points as a delta function that is zero everywhere, except at their value, where the function is one. We can then modify the Fourier transform to sum up the contribution of each point towards a specific frequency. Since we know the function of each point is zero everywhere except one location where it is one, we only have to evaluate it at one place. That gives us this:

X_k = \sum_{p \in P} e^{2 \pi i k p}

Where again, k is the frequency being analyzed (1 for 1hz, 2 for 2hz, etc), X_k is the complex frequency information for that frequency, P is the total set of points in the point set, and p is the value of the current point in the point set.

Adjusting 1D Point Sets in Frequency Space

To adjust frequency magnitudes of our point set, we are going to need to show the formula for calculating frequency magnitude and we are going to need to differentiate it to get the gradient of it, so we know which points to move in which direction to increase or decrease specific frequency magnitudes.

To calculate the magnitude of a specific frequency, we start with our equation from the last section.

X_k = \sum_{p \in P} e^{2 \pi i k p}

Remembering that e^{ix} = \cos(x) + i*\sin(x), we can turn that to this:

X_k = \sum_{p \in P} \cos(2 \pi k p) + i * \sin(2 \pi k p)

The magnitude of the frequency is the length of the vector (real, imaginary), so let’s break up the real and imaginary parts:

\Re(X_k) = \sum_{p \in P} \cos(2 \pi k p)

\Im(X_k) = \sum_{p \in P} \sin(2 \pi k p)

The magnitude would then be:

\lVert X_k \rVert= \sqrt{(\sum_{p \in P} \cos(2 \pi k p))^2 + (\sum_{p \in P} \sin(2 \pi k p))^2}

We’ll remove the square root to make it easier to differentiate, and get this:

\lVert X_k \rVert^2= (\sum_{p \in P} \cos(2 \pi k p))^2 + (\sum_{p \in P} \sin(2 \pi k p))^2

Now we need to get the derivative of that function for each point p, which together is called the gradient. This tells us how far to move each point to make the frequency magnitude increase by 1. It also tells us the direction since a negative number would mean to move the point to the left and a positive number would mean to move the point to the right.

We can use the sum rule to break our function into 2 simpler derivatives and deal with them separately:

(f(x)+g(x))' = f'(x) + g'(x)

(\lVert X_k \rVert^2)'= ((\sum_{p \in P} \cos(2 \pi k p))^2)' + ((\sum_{p \in P} \sin(2 \pi k p))^2)'

Let’s start by differentiating the real (cosine) term:

f' = ((\sum_{p \in P} \cos(2 \pi k p))^2)'

We can use the product rule to deal with the squaring:

(f(x)^2)' = (f(x)*f(x))' = f'(x)f(x) + f(x)f'(x) = 2f(x)f'(x)

f' = 2*(\sum_{p \in P} \cos(2 \pi k p))*(\sum_{p \in P} \cos(2 \pi k p))'

That is all differentiated except for the last term:

g' = (\sum_{p \in P} \cos(2 \pi k p))'

If we are differentiating this function g for a specific p (we’ll call it p_i), all of the terms are constants and disappear, except for the p_i term. This is easily differentiable because \cos(ax)' = -a \sin(ax)

g' = (\cos(2 \pi k p_i))'

g' = -2 \pi k \sin(2 \pi k p_i)

We can plug our g result back into f and get this as the derivative of our cosine (real) term:

f' = 2*(\sum_{p \in P} \cos(2 \pi k p))*-2 \pi k \sin(2 \pi k p_i)

f' = -4 \pi k*(\sum_{p \in P} \cos(2 \pi k p))*\sin(2 \pi k p_i)

If we do the same steps for the sine (imaginary) term we end up with this:

h' = 4 \pi k*(\sum_{p \in P} \sin(2 \pi k p))*\cos(2 \pi k p_i)

We can then add f’ and h’ together to get the full equation we were looking for!

(\lVert X_k \rVert^2)' = -4 \pi k*(\sum_{p \in P} \cos(2 \pi k p))*\sin(2 \pi k p_i) + 4 \pi k*(\sum_{p \in P} \sin(2 \pi k p))*\cos(2 \pi k p_i)

Here is that monster as javascript code from the demo:

If you calculate this value for each point and put it into an array, you’ll have the gradient of the frequency k’s magnitude (squared). If you want to increase the frequency k’s magnitude, you add the gradient to the points. If you want to decrease the frequency k’s magnitude, you subtract the gradient from the points.

The gradient is only guaranteed to be accurate for an infinitesimally small step, so what I do in the web app is normalize this gradient vector, and multiply it by a step size before I move the points. I also multiply the step size by how many points there are so that the distance that each point moves doesn’t decrease as more points are added (otherwise they would have to share a unit length movement among N points, which gets shorter as N gets bigger). I also have a step count which allows this operation to be done M times, re-evaluating the gradient each time. This results in an interactive user controlled gradient descent.

You can play with the demo here to see it all in action:

1D Point Set Frequencies:

For more information about why you might want to normalize a gradient when doing gradient descent, give this a read:

2D Fourier Transform of Point Sets

The 2D Fourier transform is just the 1D Fourier transform done on each axis. Because of this, you have a horizontal frequency j and a vertical frequency k, for an image that is MxN pixels:

X_{j,k} = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} x_{m,n} * e^{2 \pi i j m/M}* e^{2 \pi i k n/N}

X_{j,k} is the complex frequency information for the frequency, j is the horizontal frequency, k is the vertical frequency, x_{m,n} is the pixel value at location (m, n).

We can change this to be a Fourier transform of points p again like this:

X_{j,k} = \sum_{p \in P} e^{2 \pi i j p_x}* e^{2 \pi i k p_y}

Where p_x is the x component of the point and p_y is the y component of the point.

We can use the identity e^{a} * e^{b} = e^{a+b} to simplify the equation a bit:

X_{j,k} = \sum_{p \in P} e^{2 \pi i j p_x + 2 \pi i k p_y}

X_{j,k} = \sum_{p \in P} e^{2 \pi i (j p_x + k p_y)}

… and we are done 🙂

Adjusting 2D Point Sets in Frequency Space

Let’s jump to breaking the function into real and imaginary parts.

\Re(X_{j,k}) = \sum_{p \in P} \cos(2 \pi (j p_x + k p_y))

\Im(X_{j,k}) = \sum_{p \in P} \sin(2 \pi (j p_x + k p_y))

The magnitude would then be:

\lVert X_k \rVert= \sqrt{(\sum_{p \in P} \cos(2 \pi (j p_x + k p_y)))^2 + (\sum_{p \in P} \sin(2 \pi (j p_x + k p_y)))^2}

We can square both sides again to make it easier to differentiate:

\lVert X_k \rVert ^2= (\sum_{p \in P} \cos(2 \pi (j p_x + k p_y)))^2 + (\sum_{p \in P} \sin(2 \pi (j p_x + k p_y)))^2

Remembering that we can differentiate each term independently, let’s start with the cosine (real) part again and start differentiation using the product rule:

f' = 2 * (\sum_{p \in P} \cos(2 \pi (j p_x + k p_y))) * (\sum_{p \in P} \cos(2 \pi (j p_x + k p_y)))'

We now have two variables to get derivatives for though, the x and y components of a specific point. We’ll call them p_{ix} and p_{iy}. We’ll start with p_{ix}. Remember that all terms of the sum except the ones with p_i in them are constants and will become zero.

\frac{df}{dp_{ix}} = 2 * (\sum_{p \in P} \cos(2 \pi (j p_x + k p_y))) * -2 \pi j  \sin(2 \pi (j p_{ix} + k p_{iy}))

\frac{df}{dp_{ix}} = -4 \pi j * (\sum_{p \in P} \cos(2 \pi (j p_x + k p_y))) * \sin(2 \pi (j p_{ix} + k p_{iy}))

We can do the say with the y component and get:

\frac{df}{dp_{iy}} = -4 \pi k * (\sum_{p \in P} \cos(2 \pi (j p_x + k p_y))) * \sin(2 \pi (j p_{ix} + k p_{iy}))

We can do the same process with the sine (imaginary) part and get these two equations:

\frac{dh}{dp_{ix}} = 4 \pi j * (\sum_{p \in P} \sin(2 \pi (j p_x + k p_y))) * \cos(2 \pi (j p_{ix} + k p_{iy}))

\frac{dh}{dp_{iy}} = 4 \pi k * (\sum_{p \in P} \sin(2 \pi (j p_x + k p_y))) * \cos(2 \pi (j p_{ix} + k p_{iy}))

We then add the real and imaginary x functions together to get a value for x, and we add the real and imaginary y functions together to get a value for y.

\lVert X_k \rVert ^2 \frac{d}{dp_{ix}} = -4 \pi j * (\sum_{p \in P} \cos(2 \pi (j p_x + k p_y))) * \sin(2 \pi (j p_{ix} + k p_{iy})) + 4 \pi j * (\sum_{p \in P} \sin(2 \pi (j p_x + k p_y))) * \cos(2 \pi (j p_{ix} + k p_{iy}))

\lVert X_k \rVert ^2 \frac{d}{dp_{iy}} = -4 \pi k * (\sum_{p \in P} \cos(2 \pi (j p_x + k p_y))) * \sin(2 \pi (j p_{ix} + k p_{iy})) + 4 \pi k * (\sum_{p \in P} \sin(2 \pi (j p_x + k p_y))) * \cos(2 \pi (j p_{ix} + k p_{iy}))

That sure is a mouthful, but it ends up looking a lot nicer in code. Javascript version from the demo below:

So the gradient of the function now has a 2D vector for every entry instead of a scalar. I’m not sure if this is still called a gradient. To normalize it, I sum up the length of each of the 2D vectors and then divide them all by that value. I then multiply by the number of points for the same reason as before, and I also have a step size multiplier, and a step count, just as in the 1D case.

You can play with the demo here:

2D Point Set Frequencies:


If you are up for a challenge, I have one for you!

Blue noise points are “randomized but roughly evenly spaced” and yet, I once stumbled on how to make a point set which shows a blue noise spectrum but has clumps. It’s a preset in the 1D demo and it looks like this:

Red noise on the other hand is defined as randomized but clumping points.

The challenge is this: Can you find points which are reasonably spread out on the numberline, or otherwise not clumping, but show a red noise spectrum?

I’d be real interested in seeing it if so. You can leave a comment here or find me on twitter at

Thanks for reading and hope you found this useful and/or interesting!

Two Low Discrepancy Grids: Plus Shaped Sampling LDG, and R2 LDG

I recently wrote a blog post that shows how interleaved gradient noise is a low discrepancy grid optimized for the 3×3 neighborhood sampling you find in temporal anti aliasing (TAA):

In some TAA implementations, instead of taking the full 3×3 neighborhood around a pixel, only the 4 cardinal direction neighbors will be sampled, making a plus shape (+) of sampling. This can reduce memory bandwidth requirements because it cuts the neighborhood sampling in half, from 8 samples down to 4.

In this post I’ll show a low discrepancy grid that is optimized for this sampling pattern. The formula for it is below, where pixelX and pixelY are integer pixel coordinates.

z = ((x+3y+0.5)/5) mod 1

or as code:

float PlusShapedLDG(int pixelX, int pixelY)
    return fmodf((float(pixelX)+3.0f*float(pixelY)+0.5f)/5.0f, 1.0f);

While we are talking about LDGs I also want to show another one based on a generalization of the golden ratio to 2D, made by Martin Roberts ( which is calculated like this:

z = (x / 1.32471795724474602596 + y / (1.32471795724474602596 * 1.32471795724474602596)) mod 1

or as code:

float R2LDG(int pixelX, int pixelY)
    static const float g = 1.32471795724474602596f;
    static const float a1 = 1/g;
    static const float a2 = 1/(g*g);
    return fmodf(float(pixelX)*a1+float(pixelY)*a2, 1.0f);

At the end of the post we’ll analyze these noise types along with some others:

Derivation of Plus Shaped Low Discrepancy Grid

It took a couple attempts at deriving this before I was successful.

We want a regular grid of values where each plus shape has every value 0/5, 1/5, 2/5, 3/5, 4/5. When I say every plus shape, I’m including overlapping ones. In TAA when a pixel looks at it’s plus shaped neighborhood, we want it to get an accurate as possible representation of the total possibilities for that pixel in that region of the screen. The pixels it finds should very accurately represent the actual histogram of what is possible in this area of pixels. The more accurate we make this, the better the neighborhood sampling history rejection/preservation logic will work.

I started out by putting symbols in a plus shape like this, planning to solve for the actual values later:

I next needed to figure out how to fill in the corners of these pixels. I opted to do so like this, trying to make the repeated values be as far away from the original values as possible.

You can see that at the center of each edge is the center of a plus shaped pattern which has 4 of the 5 letters already, so we can complete the plus by adding the 5th letter.

To fill out the rest of this grid, you can notice that there is a pattern of how letters are duplicated in the above: Their copy is either two to the right and one down, or two down and one to the left. You can use this pattern to complete this 5×5 square.

After filling out this 5×5 square you can see that both rules are true: symbols are repeated both two cells down one cell to the left, and also two cells to the right and one cell down.

Interestingly, if you continue growing this square outwards, it just repeats this 5×5 tile over and over, so we are done figuring out how to tile our values, but we still don’t know where the values should be or how to make a formula that calculates them.

At first I tried plugging in 0.0 for A, 0.2 for B, 0.4 for C, 0.6 for D and 0.8 for E. That made a really messy looking grid that I was unsure how to replicate with a formula.

Thinking about it differently, I looked at the first row which goes in order B,D,E,A,C and I made the values be in that order. B got value 0.0, D got 0.2, etc. That left me with this:

To make things easier to see, here are those values multiplied by 5:

It’s a bit easier to see a pattern here, isn’t it?

Starting at the upper left as (0,0), we can see that going to the right, the value increases by 1. Since this tile repeats infinitely, it means that when we go past 4, we go back to zero. So for that the formula would be z = x % 5.

We can also notice that taking a step down on the y axis, we add 3, but once again wrap around if we get past four. Putting this into the previous equation, it becomes z = (x + 3y) % 5.

We want this divided by 5 to be the values 0/5, 1/5, … 4/5, so our final equation becomes z = Fract((x + 3y)/5). Or z = ((x + 3y)/5) mod 1. Whichever notation you prefer.

Now for some subtlety. If you take the average of 0/5, 1/5, 2/5, 3/5, 4/5 you get 0.4. To make this unbiased, we want the average to be 0.5, which we can do by adding 1/10th to every value. That means our equation becomes z = (((x + 3y)/5) + 1/10) mod 1 or z = ((x + 3y + 0.5)/5) mod 1.

Another way to solve this problem could be instead of having values 0/5, 1/5, 2/5, 3/5, 4/5, you could instead divide by 4 to get 0/4, 1/4, 2/4, 3/4, 4/4, which would average to 0.5 as well. You may very well want to do that situationally, depending on what you are using the numbers for. A reason NOT to do that though would be in situations where a value of 0 was the same as a value of 1, like if you were multiplying the value by 2*pi and using it as a rotation. In that situation, 0 degrees would occur twice as often as the other values and add bias that way, where if you were using it for a stochastic alpha test, it would not introduce bias to have both 0 and 1 values.

Before analyzing this noise, let’s talk about the R2 LDG.

R2 Low Discrepancy Grid

The R2 low discrepancy grid was made by Martin Roberts and is based on his R2 low discrepancy sequence which generalizes the golden ratio to 2D (

To make the R2 low discrepancy sequence, you divide the index by 1.32471795724474602596 and fract to get the x component of the LDS, and divide index by 1.32471795724474602596 * 1.32471795724474602596 and fract to get the y component of the LDS.

To make the R2 low discrepancy grid, you divide the integer x pixel coordinate by 1.32471795724474602596, divide the integer y pixel coordinate by 1.32471795724474602596*1.32471795724474602596, add them together and fract to get the final scalar value.

Interestingly, this works with any rank 1 lattice, so there is some exploration to be done here IMO, to find more low discrepancy grids and see what sort of properties they have.

In fact, you can express both the plus shaped LDG and this R2 LDG in this rank 1 lattice style LDG:

z = (x * A + y * B) mod 1

With the plus shaped LDG, A is 1/5 and B is 3/5.

With the R2 LDG, A is 1 / 1.32471795724474602596, and B is 1 / (1.32471795724474602596*1.32471795724474602596).


Here we’ll use various types of grid noises to turn greyscale images into black and white stippled images. We do this by testing each image pixel against the corresponding noise pixel. If the noise pixel is a lower value (darker) than the image pixel, we put a black pixel in the output, else we put a white pixel in the output.

White Noise

Blue Noise





Stochastic Transparency

Here we test noise values against the transparency value and if the noise is less, we write a magenta pixel, else we don’t. The percentage of pixels that survive the transparency test are shown, and ideally would match the transparency value for the best results.





One thing worth talking about is that the percentage of white noise pixels that survive the alpha test swings pretty wildly compared to what the actual transparency value is. This effectively makes the pixels more opaque or more transparent than they should be, which causes problems when filtering spatially and/or temporally. That is on top of how white noise clumps together and leaves holes, which make it harder to filter than more equally spaced data points.

Another thing worth pointing out is that the plus shaped noise is VERY wrong at 10% and 30% percent, but does very well at 20% and 40%. The reason for this is because of how the plus noise is discretized into 1/5th increments. The other noises have all values (0 to 255, because these are U8 textures) which means they work better at arbitrary opacities.

With all noises except the plus noise, as you smoothly increase the opacity, pixels will slowly start appearing. With the plus noise, as you smoothly increase the opacity, pixels will appear in large clumps, instead of appearing one by one. A way to deal with this could be to take a hint from stratified sampling, and instead of adding 1/10th to the noise to unbias it, you instead add a random number between 0 and 1/5th. It will still have the correct average, so won’t be biased, but the random numbers could break things up a bit. You could even use a blue noise texture as the source of those random numbers perhaps.

Here is a histogram of each noise, which shows what I’m talking about regarding the plus noise:

3×3 Region Analysis

I’ll only include IGN and the new ones. The rest of them can be found in the IGN LDG blog post at




The plus shaped noise does very poorly in these 3×3 regions, because there are only 5 possible values, and we are looking at how unique the values are from 9 different pixels. It definitely is not optimized for this usage case.

R2 does quite a bit better, but not as good as IGN, which makes sense because R2 is meant for “general purpose use” as a low discrepancy grid, where IGN is meant specifically to have great LDS properties in a 3×3 region.

Plus Shaped Region Analysis







In this test, taking plus shaped samples and analyzing how their values lie on a numberlines, white noise shows the worst results as per usual. Bayer and also blue noise don’t do that great either.

Now, unlike the last test, where IGN beat R2, we can see that R2 beats IGN. This shows again that R2 is good in “general purpose uses” where IGN is optimized towards just 3×3 blocks.

Lastly, we see the plus noise doing the best here – in the situation it was optimized for, which is no surprise. Any randomization added to this noise to help break up the quantization artifacts will make this specific test have a higher standard deviation of distances. With good noise (like blue noise?) used to jitter, the standard deviation may only go up a little. Having the standard deviation go up a little bit probably would help results in general when using this noise. After all, the goal of low discrepancy sequences is to have LOW discrepancy (discrepancy being some variance in the spacing here) but not NO discrepancy, since having no discrepancy is regularly spaced sampling, which has some bad properties, including aliasing.

Plus Shaped Noise vs IGN

Jorge (maker of IGN) derived the same plus shaped sampling noise that I did (I did my derivation after he said he had found such a noise, and then we compared to see if we found the same thing). He put the noise through some tests, using a plus shaped neighborhood sampling TAA implementation and he found that IGN performed better than this plus shaped sampling noise. I’m not sure the details of his test, or how much better IGN did, but it would be interesting to do some analysis and share those details. I may do that at some point, but if someone else does it first, please share! I’m curious if the problems came up due to the discretized values of this plus noise, and if jittering the values using good noise helps the problems at all.

You might be wondering how IGN is fully floating point when the plus noise is discretized.

If we tried to derive IGN the same way as we did with the plus noise, you would want to make every 3×3 block of pixels to have every value 0/9, 1/9, … 8/9, even overlapping ones. If you work through this generalized sudoku, you’ll find that there are too many constraints and it actually isn’t solvable. A way to get around this is to have some numerical drift of the values over space, so that you spread the error of it not being solvable over distance. That is what IGN does and is why it isn’t a discretized noise, having only x/9 values. I’m not sure if IGN optimally distributes this error evenly over distance or not though. That would be an interesting thing to look at.


Hopefully you found this post interesting and have some new tools in your toolbelt.

If you ever need VECTOR valued noise but only have SCALAR valued noise, you can try putting your scalar values through a Hilbert curve to turn your scalars into vectors. In my experience, this isn’t as high quality as having true vector valued noise, but does actually work in preserving the scalar noise properties in the resulting vector valued noise somewhat so is a lot better than nothing.

If you try using this noise or try out any of the things mentioned above or similar, it would be great to hear how it goes for you, either here as a comment or on twitter at

Thanks for reading!

Distance Between Points on a “Toroidal Disk”

Life update: I’m now doing applied graphics research at EA SEED and really digging it. Really great people, great work to be done, and lots of freedom to do it. I’m really stoked, to be honest.

In a previous post, I wrote about how to calculate the distance between two points in a rectangle, where the rectangle edges wrapped around. That is, if you leave the rectangle by going past the right edge, you’d end up coming out of the left edge. If you go past the top edge, you’d come out of the bottom edge.

Calculating the Distance Between Points in “Wrap Around” (Toroidal) Space

This weird space can be thought of as a toroid, or a doughnut. If you start at a point on a doughnut and go “upwards” on the surface, you’ll circle around back to where you started. Similarly, going left, you’ll circle around the doughnut and come back to where you started. More technically, this space is a “flat toroid” though, so isn’t quite the same thing as a doughnut. Video games have used this space for their game worlds, such as the classic “Asteroids”.

How was collision detection done on the Asteroids arcade game? -  Retrocomputing Stack Exchange
Asteroids: One of the first of many games to take place on the surface of a doughnut

I was recently experimenting with something (the experiment failed unfortunately ☹️) that involved working in a disk where going to the edge of the disk would teleport you to the center of the disk, and going to the center of the disk would teleport you to the edge of the disk. It’s like if you squished a doughnut flat to make the center hole infinitely small, and you removed the back side of the doughnut.

As part of this work, I needed to be able to calculate the distance between two points in this strange space.

It was a fun problem, so maybe you want to give it a shot before i explain how I did it. I’d love to hear what you come up with, either here as a comment on this post, or on twitter at

My Solution

Ok so I knew that there would be a few ways to get from a point A to a point B in this toroidal disk, and that we’d want to take the shortest of these paths as the length between them.

One path would be the line in the circle between the points.

Another path would be from point A to the edge of the circle, to get to the center, and from the center to point B. It’s interesting to note that these lines don’t have to be parallel!

Another path would be from point A to the center of the circle to get to the edge, and from the edge to point B.

Now for some stranger cases…

You could go from point A to the edge, to the center of the circle, then back to the edge at a different location and to point B.

You could also go form point A to the center, change direction and go to point B. This last case could only be as short as the “interior” path at minimum, so isn’t really a case we have to consider, but I’m including it for completeness.

Ok so we could turn this into code with a bunch of if statements to handle each case, but we can simplify this logic quite a bit.

First up, we do have to calculate the distance between the points in the circle the normal way, there’s no avoiding that. We’ll call that the “interior distance”.

Next, think about the paths point A can take to point B that aren’t through the disk. It can either go to the edge or the center, and we’ll want to keep whichever is shorter as the distance for the first part of the “exterior distance”. Since we want to take the shortest path to the center or the edge, we can just use the point’s radius as the distance form the center and 1.0 – radius as the distance from the edge (assuming a radius 1 circle). We’ll take the minimum distance between those two as the first part of the exterior distance path.

Next, it doesn’t matter if point A went to the center or the edge, the path can come out of either the center or edge and continue to point B. So, we again want the minimum of the distances: point B to the edge of the disk, or point B to the center of the disk. once again it’s the minimum of the radius and 1.0 – radius.

Adding those two distances together we get the “exterior distance”

The “Exterior Distance” Is min(A1, A2) + min(B1, B2)

The final answer we return as the distance between point A and B is whichever is smaller: the interior or exterior distance.

A fun thing is that while we have been working in 2 dimensions, this actually works for any dimension.

Here is some C++ code to calculate the distance:

// walking past the end of the circle brings you back to the center, and vice versa
// Assumes your points are in [0,1)^N with a disk center of (0.5, 0.5, ...)
template <size_t N>
float DistanceUnitDiskTorroidal(const std::array<float, N>& v1, const std::array<float, N>& v2)
    // Calculate the distance between the points going through the disk.
    // This is the "internal" distance.
    float distanceInternal = Distance(v1, v2);

    // The external distance is the distance between the points if going through the center
    // or past the edges.
    // This is the sum of the distance between each point and either the circle edge or the
    // circle center, whichever is closer.
    float distanceExternal1 = Length(v1 - 0.5f);
    distanceExternal1 = std::min(distanceExternal1, 0.5f - distanceExternal1);

    float distanceExternal2 = Length(v2 - 0.5f);
    distanceExternal2 = std::min(distanceExternal2, 0.5f - distanceExternal2);

    float distanceExternal = distanceExternal1 + distanceExternal2;

    // return whichever is less, between the internal and external distance.
    return std::min(distanceInternal, distanceExternal);

Again, I’d love to hear your thoughts, or any alternative methods you may have come up with, either here or on twitter at Thanks for reading!


On twitter, mentioned that it would be nice to see the distance from one point to the rest of them visualized, to get a better sense of how this distance function worked. That was a great idea so i made an interactive shadertoy:

Here are a few images. The blue dot is the point that the distances are calculated from. Red is no distance, Green is full distance.

Interleaved Gradient Noise: A Different Kind of Low Discrepancy Sequence

The python code that goes along with this post can be found at

In 2014, Jorge Jimenez from Activision presented a type of noise optimized for use with Temporal Anti Aliasing called Interleaved Gradient Noise or IGN ( This noise helps the neighborhood sampling history rejection part of TAA be more accurate, allowing the render to be closer to ground truth. IGN was ahead of it’s time. It still isn’t as well known or understood as it should be, and it shows the way for further advancements.

IGN can be used whenever you need a per pixel random number in rendering, and in this post we’ll compare and contrast IGN against three of its cousins: white noise, blue noise and Bayer matrices. Below are the 16×16 textures that we’ll be using for comparisons in this post.

For a first comparison, let’s look at the histograms of each texture. There are 256 pixels and the histogram has 256 buckets.

IGN is made by plugging the integer x and y pixel coordinates into a function and gives a floating point value out. It has a fairly uniform histogram. White noise is floating point white noise and has a fairly uneven histogram. Blue noise was made with the void and cluster algorithm, stored in a U8 texture, and has a perfectly uniform histogram – all 256 values are present in the 16×16 texture. Bayer also has all 256 values present in the texture.

Here is C++ code for calculating IGN:

float IGN(int pixelX, int pixelY)
    return std::fmodf(52.9829189f * std::fmodf(0.06711056f*float(pixelX) + 0.00583715f*float(pixelY), 1.0f), 1.0f);

How Is IGN Low Discrepancy?

An informal definition of low discrepancy is that the density of points in an area is close to the amount of area divided by the number of points. That is, if you had 10 points, you’d expect every 1/10th section of the area to have one point in it, and you’d expect all 3/10th sections to have 3 points. An important note is that low discrepancy sequences want LOW discrepancy, but not zero discrepancy. Check out wikipedia for a more formal explanation:

Evenly distributed samples are good for sampling, and thus numerical integration. Imagine you had a photograph and you wanted to calculate the brightness of the photo by taking 10 sample points and averaging them, instead of averaging all of the pixels. If your sample points clumped together in a few spots, your average will likely be too bright or too dark. If your points are evenly spaced all over the image, your average is more likely to be more accurate.

Zero discrepancy is regular sampling though, which can resonate with patterns in the data and give biased results. Low discrepancy avoids that, while still gaining benefits of being fairly evenly distributed.

IGN is low discrepancy in a different sort of way. If you look at any 3×3 block of pixels, even overlapping ones, you will find that the 9 values roughly match all values 0/9, 1/9, 2/9, … , 8/9, but that they are a bit randomized from the actual values. Every 3×3 block of pixels makes a low discrepancy set on the 1D number line.

Let’s pick a couple blocks of pixels and look at the distance between values in those pixels. First is IGN, which has a very low, and constant, standard deviation. The values are well spaced.

Here is white noise which has clumps and voids so has very high variance in distance between values:

Here is blue noise which does a lot better than white noise, but isn’t as good as IGN.

Lastly here is Bayer which is better than white noise, but is still pretty clumpy.

How Does IGN Make TAA Work Better?

TAA, or temporal anti aliasing, tries to make better renders for cheaper by amortizing rendering costs across multiple frames. Why take 10 samples in 1 frame, when you can take 1 sample for 10 frames and combine them?

The challenge in TAA is that objects are often moving, and so is the camera. You can use the current frame’s camera matrix, the previous frame’s camera matrix, and motion vectors to try and map pixels between frames (called temporal reprojection) but there are times when objects become occluded, or similar events that cause the found history to actually be invalid. If you don’t handle these cases and throw out the invalid history, you get ghosting where pixels use invalid history.

A common way to handle the problem of ghosting is to make a minimum and maximum RGB color cube of the 3×3 neighboring pixel colors for the current frame of a pixel, and clamp the previous frame’s pixel color to be inside of that box. The clamping makes any history which is too different be much closer to what is expected. The previous frame’s clamped pixel color is then linearly interpolated towards the current frame’s pixel color by a value such as 0.1. That leaky integration is called “Exponential Moving Average” which allows a running average that forgets old samples over time, without having to store the previous samples.

A great read for more details on TAA is “A Survey of Temporal Antialiasing Techniques” by Yang et al:

So where does IGN come in?

When TAA samples the 3×3 neighborhood, the intent is to get an idea of what possible colors the pixel should be able to take, based on the other other pixels in the local area. The more this neighborhood accurately represents the possible values of pixels in this local area, the more accurate the color clipping history rejection will be. IGN makes the local area more accurately represent the full set of possibilities in small neighborhoods of pixels.

For instance, let’s say you had a bright magenta object in front of a dark green forest background, and you were using stochastic alpha to make the magenta object be semi transparent. That is, the bright magenta object may have an opacity of 0.1111… (1/9) so using a random number per pixel in this object, you’d let 1/9th of the pixels be written to the screen, while 8/9ths of them would be discarded.

Ideally, you’d want every 3×3 block of pixels in this magenta object to have a single magenta pixel surviving the stochastic alpha test so that the neighborhood sampling would see that magenta was a possibility, and to keep the previous pixel’s history instead of rejecting it, allowing the pixel to converge to 1/9th transparency better.

With white noise random numbers, you would end up with clumps of magenta pixels and voids where they should be but aren’t. This makes TAA reject history more often than it should, making for a worse, less converged result.

With IGN, every 3×3 block of pixels (even overlapping blocks) has a low discrepancy set of scalar values, so you can expect that out of every 3×3 block of 9 pixels, that 1 pixel will survive the stochastic alpha test. This is how IGN improves rendering under TAA.

Blue noise sort of has this property, but not as much as IGN does. Bayer looks like it has this property but the regular grid of the result isn’t good for diagonal distances, while also looking more artificial.

Stochastic transparency using various per pixel noise types. The object has 1/9th transparency. Friends don’t let friends use white noise!

In other situations where you need a per pixel random number, results like the above will normally hold as well (small regions of pixels will more accurately represent all the possibilities), this isn’t limited to stochastic alpha.

Derivation Of IGN And Extensions

If you were to sit down to make IGN you might define your constraints as: “Every 3×3 block of pixels in an infinite texture should have the values 1 through 9”. At this point, you’ve basically described sudoku. If you then go on to add “Also, this should include OVERLAPPING blocks”, you’ve made a generalized sudoku. It turns out this is too many conflicting constraint and is not solvable. A way to get around this problem would be to put a little bit of drift in the numbers over space so that it was mostly solved and the error of the imperfect solution was distributed over space. At this point, you have reached how IGN works.

I asked Jorge how he made IGN and it turned out to involve spending a full 8 hour day (or was it longer? I forget!) sitting at a computer tweaking constants by hand until they had the properties he was looking for. That is some serious dedication!

Much like in the spatiotemporal blue noise work ( you might be wondering how to animate IGN over time. Jorge found a way to scroll the IGN over time to make individual pixels have better sampling over time, while still being perfectly good IGN over space. You scroll the texture by 5.588238 pixels each frame:

float IGN(int pixelX, int pixelY, int frame)
    frame = frame % 64; // need to periodically reset frame to avoid numerical issues
    float x = float(pixelX) + 5.588238f * float(frame);
    float y = float(pixelY) + 5.588238f * float(frame);
    return std::fmodf(52.9829189f * std::fmodf(0.06711056f*float(x) + 0.00583715f*float(y), 1.0f), 1.0f);

If you are wondering how you might be able to make vector valued IGN, we did that in our spatiotemporal blue noise work by putting the scalar IGN values through a Hilbert curve. The scalar value was multiplied (and rounded) to make an integer index, and that was put into the Hilbert curve to make a vector out. When we used those vectors for rendering, the resulting noise in the render was very close to scalar IGN. There are probably other methods, but this ought to be a good starting point.

Proposed Terminology: Low Discrepancy Grids

Low discrepancy sequences are ordered sequences of scalar or vector values. They are a function that looks like the below, with index being an integer, and value being a vector or a scalar:

\mathbf{\text{value}} = f(\text{index})

Or in C++:

std::vector<float> LowDiscrepancySequence(int index);

IGN works differently though. You plug in an integer x and y pixel coordinate and it gives you a floating point scalar value.

\text{value} = f(\text{Pixel}_{xy})

Or in C++:

float IGN(int pixelX, int pixelY);

Low discrepancy sequences are in contrast to Low discrepancy sets. Sequences have an order, and taking any number of the values starting at index 0 will also be low discrepancy. Low discrepancy sets don’t have an order, and should only be expected to be low discrepancy if all the values in the set are considered together.

Other terminology calls low discrepancy sequences “progressive” and low discrepancy sets “non progressive”.

So what should we call IGN or similar noise functions that take in a multi dimensional integer index and spit out a scalar value? There is definitely an ordering, so it isn’t a set, but the ordering is 2 dimensional and there really isn’t a starting location, since negative numbers work just as well as positive numbers in the formula.

I propose we should refer to them as low discrepancy grids. That would cover the various types of grids: regular, irregular, skewed, curvilinear, and beyond, and these in any dimension. IGN itself more specifically would be a low discrepancy regular grid, or a low discrepancy cartesian grid.


Related wikipedia pages:


Interleaved Gradient Noise is a very interesting noise pattern for use with per pixel random numbers, optimized towards neighborhood sampling rejection based TAA.

Even though it isn’t as widely known or understood as it should be, a secondary value to this work is showing that per pixel random numbers / sampling patterns can be generated for specific needs with great success.

This concept, along with the importance sampled vector valued spatiotemporal blue noise work recently put out are just two instances of this more general concept, and I believe they are just the beginning of other things yet to be created.

Why Can’t You Design Noise in Frequency Space?

The python code that goes along with this blog post can be found at

To evaluate the quality of a blue noise texture, you can analyze it in frequency space by taking a discrete Fourier transform. What you want to see is something that looks like tv static (white noise) with a darkened center, like the below. The frequencies in the center are the low frequencies, while the frequencies towards the edges are the high frequencies. This DFT shows high frequency randomness without any low frequency content, which is what blue noise is.

Left: blue noise texture. Right: discrete Fourier transform of that texture, showing a blue noise spectrum.

A common question then is: why can’t you just make what you want in frequency space and do an inverse Fourier transform to get the noise out you want? This could let you make all sorts of custom crafted types of noise, not just spatial blue noise.

Let’s try that out in 1D and see what happens.

First we make N complex values from polar coordinates that have a random angle 0 to 2pi and a random radius from 0 to 1. These will be the frequencies for our N noise values. We also want to make sure that the + and – frequency bins are complex conjugates of each other so that when we do an IDFT, we’ll get a strictly real valued signal.

After initializing these frequencies to white noise, we’ll multiply the values by a gaussian kernel to make the values towards the edges smaller. This is a low pass filter since the higher frequencies are reduced and the lower frequencies are mostly left alone. At this point, an IDFT would give us low frequency red noise, so we subtract these frequencies from the original white noise initialized frequencies. This is a high pass filter because the higher frequencies are left alone, while the low frequencies are reduced. At this point, an IDFT would give us high frequency blue noise. (There are a couple other things done, like setting the 0hz DC frequency bucket to a specific value. Check out the python code for more details.) Here is what we get if we do this for 64 noise values (N = 64):

Let’s see how this compares to 64 blue noise values made with the void and cluster algorithm:

The frequencies in the DFTs (right) look pretty similar but the histogram (2nd from right) from the void and cluster algorithm are much more uniform, and the values (3rd from right) look a lot more even. The output of the IDFT actually gave us the “raw values” shown 2nd from left in the first image, which are out of the [0,1] range, but are scaled and shifted to make the “normalized values” shown next to it.

Let’s look at the histogram and DFT for each at 10,240 samples. First is the IDFT method, then void and cluster generated blue noise.

So interestingly, the IDFT method makes noise that is gaussian distributed. This kind of makes sense because we are filling out frequencies as uniform random white noise, which are turning into uniform random white noise sinusoids that are being summed together, which will tend towards a gaussian distribution as you sum up more of them. In contrast, the void and cluster method makes uniform distributed values which are perfectly uniform.

The other interesting difference is that the IDFT method has frequency magnitudes very closely matching a gaussian, while the void and cluster algorithm has distinct valleys. I’ve seen these valleys show up as ripples in DFTs like the below, for 2D blue noise DFTs. It’s unclear to me if the ripples add value to the noise or if they are an imperfect artifact, but seeing as we often see these ripples in DSP (like with sinc), it’s my guess that these probably do add value, but I can’t quantify it.

Most blue noise textures are uniform distributed (we recently put out some work showing how to make them non uniform distributed though: but if you wanted gaussian distributed blue noise for some reason, maybe this IDFT method would work well for you? Hard to say but it could be interesting to try it out.

This is ultimately what the problem with the IDFT method is though… you get gaussian distributed values, not uniform, and the noise seems to be lower quality as well. If these issues could be solved, or if this noise has value as is, I think that’d be a real interesting and useful result. It would be interesting to then take this to vector valued masks and see if the same could be done there (check out my last post for more info:

The way I made noise through IDFT may be completely different than what you have in mind, and if so, you may get very different results. I’d love to hear any thoughts. I’m on twitter as @Atrix256.

I wonder if doing gradient descent on histograms and frequency phases could make uniform distributions and higher quality noise? Also, while there is importance in blue noise being actual blue noise (high frequency, better perceptually and designed to be removed with a gaussian blur), there is also importance in the fact that neighboring pixel values are very different from each other. I haven’t seen any methods for generating blue noise that were based on (anti)correlation but I would bet there’s a method waiting to be found there. If you do an auto correlation of a blue noise texture, it shows that pixels have anti correlation with their neighbors, and slight correlation with the neighbors of their neighbors, and even slighter anti correlation with the neighbors of those neighbors and so on. The ripple goes flat pretty quickly, so maybe an algorithm to satisfy those constraints wouldn’t be that difficult or have that long of a run time?

More Data

Here are some more comparisons of (1) Void and Cluster blue noise, (2) IDFT high pass filtered white noise to make blue noise, (3) IDFT low pass filtered white noise to make red noise. We’ll compare 8 values, 16, 32, 64, 128 and 256.

8 values:

16 values:

32 values:

64 values:

128 values:

256 values: