Calculating the Distance Between Points in “Wrap Around” (Toroidal) Space

Let’s say you are trying to find the distance between two points in 2D, but that these points are in a universe that “wraps around” like old video games – leaving the screen on the right, left, top or bottom side makes you re-appear on the opposite edge.

This universe is actually shaped like a toroid, also known as a doughnut. It’s actually an impossible object, a “flat torus”, so not exactly a doughnut, but whatever.

If you imagine yourself on the surface of a doughnut, it would behave exactly this way. If you go “down” you end up where you previously considered “up”. If you go far enough “left” you end up where you previously considered “right”.

How would you calculate the distance between two points in a universe like this?

Let’s imagine the situation below where we are trying to find the distance between the red point and the green point:

One way to do this would be to pick one of the points (I’m picking red in this case) and clone it 8 times to surround the cell like the below. You’d calculate the distance from the green point to each of the 9 red points, and whatever distance was smallest would be the answer.

Something not so desirable about this is that it takes 9 distance calculations to find the minimum distance. You can work with squared distances instead of regular distances to avoid a square root on each of these distance calculations, but that’s still a bit of calculation to do.

Going up in dimensions makes the problem even worse. In 3D, it requires 27 distance calculations to find the shortest point, and 81 distance calculations in 4D!

Luckily there’s a better way to approach this.

Let’s say that our universe (image) is 1 unit by 1 unit big (aka we are working in texture UVs). If you look at the image with 9 copies of the red dot, you can see that they are just the 9 possible combinations of having -1, +0, +1 on each axis added to the red dot’s coordinates. All possible combinations of the x and y axis having -1, +0 or +1 added to them are valid locations of the red dot.

Looking at the distance formula we can see that if we minimize each axis individually, that we will also end up with the minimal distance overall.

$d = \sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$

So, the better way is to minimize each axis individually.

On the x axis you’d find if the x axis distance between the red and green point is minimal when you subtract 1 from the red dot’s x axis position, leave it alone, or add 1.

Whichever x axis value of the red dot gives you the minimal x axis 1D distance is the x axis location to use.

You’d repeat for the y axis to get the y axis location to use (and would repeat for any further axes for higher dimensions).

This gives you the closest point which you can then plug into the distance formula to get the distance between the points in this wrap around space.

You can actually do better though.

Still working on each axis individually, you can calculate the absoluate value of the 1D distance between the two points on that axis. If that distance is greater than 0.5, the real distance for that axis is 1-distance.

The intuition here is that if you are in a 1d repeating space, if going from A to B is more than half the distance, it means that you went the wrong way, and that going the other way is shorter. The distance of that other way is one minus whatever distance you just calculated since the distance from one point to itself is 1!

Do that for each axis and use those 1d distances in the distance formula to get the actual distance.

This lets you minimize the distance without having to explicitly figure out which combination makes the point closest.

More importantly, it lets you efficiently calculate the distance between the two points in toroidal space (doughnut space!)

The computational complexity is a lot better. It’s now linear in the number of dimensions: $O(N)$ , instead of $O(3^N)$ .

Here is some C++ to show you how it would work in 2D.

float ToroidalDistance (float x1, float y1, float x2, float y2)
{
    float dx = std::abs(x2 - x1);
    float dy = std::abs(y2 - y1);

    if (dx > 0.5f)
        dx = 1.0f - dx;

    if (dy > 0.5f)
        dy = 1.0f - dy;

    return std::sqrt(dx*dx + dy*dy);
}

I hit this problem trying to make a tileable texture. I needed to place a few circles on a texture such that the circles weren’t too close to each other, even when the texture was tiled.

The calculations above gave me the basic tool needed to be able to calculate distances between points. Subtracting circle radii from the distance between points let me get toroidal distance between circles and make sure I didn’t place them too closely to each other.

That let me make an image that kept the distance constraints even when it was tiled.

Here’s an example image by itself:

Here is the image tiled:

Half Tile Offset Streaming World Grids

A number of years ago I worked on an open world game that got cancelled. Despite it not being released, I learned a few things.

If you try to make a game where the player can walk around a large world without loading screens, chances are that the whole world won’t fit in memory at once.

Since it can’t all fit in memory at once, as the player moves around you are going to have to unload old chunks of the world to make room for the new chunks that the player is about to enter.

To simplify the problem, it’s really common to break the world up into a grid, and keep a radius of tiles loaded around the player at any one time.

In the above you can see that 9 tiles are kept in memory. If the player crosses the boundary to the cell to the left, we unload the cells on the right (red cells) and load new cells in on the left (green cells).

The idea is that we keep a border of cells loaded around the player at all times so that they never see the edge of the world, but we don’t have to keep the whole world in memory at the same time.

The size of the cells can vary depending on the actual needs of your game. If you can travel very quickly in your game, you may need larger cells.

Instead of just having 9 large cells loaded at any one time, you may instead opt to have smaller cells but have more layers of them loaded at any one time. The below has 2 layers of cells loaded around the player, so has to keep 5×5 = 25 cells loaded at any one time. This can give you more granularity if you need it.

The number of cells you have to keep in memory when keeping N layers of cells loaded around the player is $(2N+1)^2$ .

1 layer means 9 cells, 2 layers mean 25 cells, 3 layers mean 49 cells, 4 layers mean 81 cells and so on.

This isn’t the only way to arrange your world tiles though. You can also make it so every row of tiles is offset by half a tile from the one above it. That gives you a setup like this:

With this setup, keeping a single layer of cells loaded around the player takes only 7 cells instead of 9. That might not sound like much, but that means that your memory budget is 129% of what it was the other way.

Alternately, it means you can keep your cells at the same quality level but only have to load 78% as much stuff from disk as the player moves around the world.

For keeping N layers of cells around the player you need to keep $3N^2+3N+1$ cells in memory.

1 layer means 7 cells, 2 layers mean 19 cells, 3 layers mean 37 cells, 4 layers mean 61 cells.

Here’s a table to show you how the regular grid compares to the half offset grid for cell counts. The savings do get better with more layers, but not very quickly.

$\begin{array}{c|c|c|c} \text{Layers} & \text{Regular Grid} & \text{Half Offset Grid} & \text{Size} \\ \hline 1 & 9 & 7 & 77.8\%\\ 2 & 25 & 19 & 76\%\\ 3 & 49 & 37 & 75.5\%\\ 4 & 81 & 61 & 75.3\%\\ \end{array}$

Overall, this is a pretty cool technique that is pretty low cost if you do this early in the project. Once you have a lot of content divided into a regular grid, it can be a challenge to move over to this half tile offset grid.

Some Notes From Readers

@chrispewebb said that an issue he’s faced when going this method is having T junctions on LODed terrain, but that skirts should be able to help there.

@runevision pointed out that while the memory requirements are lowered, so is the shortest distance (radius) from the player to data that isn’t loaded. One idea to deal with this if it’s a problem could be to use smaller cell sizes and to do more layers to make up for it.

A Tool To Debug Teams (Knoster)

In the professional world, programmers work in teams as a rule, with very few exceptions.

For the programmers aiming to remain programmers, and not going into management, we are often focused on our specific trade or area of expertise though, and so we spend less time learning about or thinking about what makes a team successful.

We learn some from personal experience – realizing that certain things are bad for a team many times by seeing the failures manifest in front of us – but we are definitely more likely to pick up a book on algorithms than we are a book on team management.

My mother in law is the opposite however, as part of what she does is mentor people to being leaders of teams and large organizations, and also consults to organizations in the field of education to fix budgetary and organizational problems they may be having.

She showed me an interesting chart the other day that is really eye opening. It’s a formalized look at how to identify some things that may be going wrong with a team.

The chart itself is from the educational sector (Tim Knoster in ~1990), and is meant to be used to “Manage Complex Change”, but looking at it, and having been a professional programmer for 16 years, it is definitely applicable to any team.

The chart is valuable whether you are leading a team, part of a team, or observing a team you are not a part of.

How you use this chart is you look on the right side to see what sort of problems your team may be having: confusion, anxiety, resistance, frustration, or false starts.

From there you scan left until you find the black box. That box is the element missing which is causing the problem for the team.

That’s all there is to it, it’s pretty simple. It actually seems like pretty obvious stuff too in hindsight, but I wouldn’t have been able to formalize something like that.

Obviously not every situation can be boiled down into a simple chart like this, and there are variations of this chart including more or different rows and columns, but this is a good start at trying to “debug a team” to figure out the source of an issue.

Want more details? Here are some links:

http://www.belb.org.uk/downloads/rc_knoster_managing_complex_change.pdf

http://www.d11.org/LRS/PersonalizedLearning/Documents/KnosterMANAGINGCOMPLEXCHANGE.pdf

http://nebula.wsimg.com/90f9e490329402583fea599cad009bb0?AccessKeyId=8AAC8D005153628DDDFA&disposition=0&alloworigin=1

When Random Numbers Are Too Random: Low Discrepancy Sequences

Random numbers can be useful in graphics and game development, but they have a pesky and sometimes undesirable habit of clumping together.

This is a problem in path tracing and monte carlo integration when you take N samples, but the samples aren’t well spread across the sampling range.

This can also be a problem for situations like when you are randomly placing objects in the world or generating treasure for a treasure chest. You don’t want your randomly placed trees to only be in one part of the forest, and you don’t want a player to get only trash items or only godly items when they open a treasure chest. Ideally you want to have some randomness, but you don’t want the random number generator to give you all of the same or similar random numbers.

The problem is that random numbers can be TOO random, like in the below where you can see clumps and large gaps between the 100 samples.

For cases like that, when you want random numbers that are a little bit more well distributed, you might find some use in low discrepancy sequences.

The standalone C++ code (one source file, standard headers, no libraries to link to) I used to generate the data and images are at the bottom of this post, as well as some links to more resources.

What Is Discrepancy?

In this context, discrepancy is a measurement of the highest or lowest density of points in a sequence. High discrepancy means that there is either a large area of empty space, or that there is an area that has a high density of points. Low discrepancy means that there are neither, and that your points are more or less pretty evenly distributed.

The lowest discrepancy possible has no randomness at all, and in the 1 dimensional case means that the points are evenly distributed on a grid. For monte carlo integration and the game dev usage cases I mentioned, we do want some randomness, we just want the random points to be spread out a little more evenly.

If more formal math notation is your thing, discrepancy is defined as:
$D_{N}(P)=\sup _{{B\in J}}\left|{\frac {A(B;P)}{N}}-\lambda _{s}(B)\right|$

You can read more about the formal definition here: Wikipedia:
Equidistributed sequence

For monte carlo integration specifically, this is the behavior each thing gives you:

High Discrepancy: Random Numbers / White Noise aka Uniform Distribution – At lower sample counts, convergance is slower (and have higher variance) due to the possibility of not getting good coverage over the area you integrating. At higher sample counts, this problem disappears. (Hint: real time graphics and preview renderings use a smaller number of samples)
Lowest Discrepancy: Regular Grid – This will cause aliasing, unlike the other “random” based sampling, which trade aliasing for noise. Noise is preferred over aliasing.
Low Discrepancy: Low Discrepancy Sequences – In lower numbers of samples, this will have faster convergence by having better coverage of the sampling space, but will use randomness to get rid of aliasing by introducing noise.

Also interesting to note, Quasi Monte Carlo has provably better asymptotic convergence than regular monte carlo integration.

1 Dimensional Sequences

We’ll first look at 1 dimensional sequences.

Grid

Here are 100 samples evenly spaced:

Random Numbers (White Noise)

This is actually a high discrepancy sequence. To generate this, you just use a standard random number generator to pick 100 points between 0 and 1. I used std::mt19937 with a std::uniform_real_distribution from 0 to 1:

Subrandom Numbers

Subrandom numbers are ways to decrease the discrepancy of white noise.

One way to do this is to break the sampling space in half. You then generate even numbered samples in the first half of the space, and odd numbered samples in the second half of the space.

There’s no reason you can’t generalize this into more divisions of space though.

This splits the space into 4 regions:

8 regions:

16 regions:

32 regions:

There are other ways to generate subrandom numbers though. One way is to generate random numbers between 0 and 0.5, and add them to the last sample, plus 0.5. This gives you a random walk type setup.

Here is that:

Uniform Sampling + Jitter

If you take the first subrandom idea to the logical maximum, you break your sample space up into N sections and place one point within those N sections to make a low discrepancy sequence made up of N points.

Another way to look at this is that you do uniform sampling, but add some random jitter to the samples, between +/- half a uniform sample size, to keep the samples in their own areas.

This is that:

I have heard that Pixar invented this technique interestingly.

Irrational Numbers

Rational numbers are numbers which can be described as fractions, such as 0.75 which can be expressed as 3/4. Irrational numbers are numbers which CANNOT be described as fractions, such as pi, or the golden ratio, or the square root of a prime number.

Interestingly you can use irrational numbers to generate low discrepancy sequences. You start with some value (could be 0, or could be a random number), add the irrational number, and modulus against 1.0. To get the next sample you add the irrational value again, and modulus against 1.0 again. Rinse and repeat until you get as many samples as you want.

Some values work better than others though, and apparently the golden ratio is provably the best choice (1.61803398875…), says Wikipedia.

Here is the golden ratio, using 4 different random (white noise) starting values:

Here I’ve used the square root of 2, with 4 different starting random numbers again:

Lastly, here is pi, with 4 random starting values:

Van der Corput Sequence

The Van der Corput sequence is the 1d equivelant of the Halton sequence which we’ll talk about later.

How you generate values in the Van der Corput sequence is you convert the index of your sample into some base.

For instance if it was base 2, you would convert your index to binary. If it was base 16, you would convert your index to hexadecimal.

Now, instead of treating the digits as if they are $B^0$ , $B^1$ , $B^2$ , etc (where B is the base), you instead treat them as $B^{-1}$ , $B^{-2}$ , $B^{-3}$ and so on. In other words, you multiply each digit by a fraction and add up the results.

To show a couple quick examples, let’s say we wanted sample 6 in the sequence of base 2.

First we convert 6 to binary which is 110. From right to left, we have 3 digits: a 0 in the 1’s place, a 1 in the 2’s place, and a 1 in the 4’s place. $0*1 + 1*2 + 1*4 = 6$ , so we can see that 110 is in fact 6 in binary.

To get the Van der Corput value for this, instead of treating it as the 1’s, 2’s and 4’s digit, we treat it as the 1/2, 1/4 and 1/8’s digit.

$0 * 1/2 + 1 * 1/4 + 1 * 1/8 = 3/8$ .

So, sample 6 in the Van der Corput sequence using base 2 is 3/8.

Let’s try sample 21 in base 3.

First we convert 21 to base 3 which is 210. We can verify this is right by seeing that $0 * 1 + 1 * 3 + 2 * 9 = 21$ .

Instead of a 1’s, 3’s and 9’s digit, we are going to treat it like a 1/3, 1/9 and 1/27 digit.

$0 * 1/3 + 1 * 1/9 + 2 * 1/27 = 5/27$

So, sample 21 in the Van der Corput sequence using base 3 is 5/27.

Here is the Van der Corput sequence for base 2:

Here it is for base 3:

Base 4:

Base 5:

Sobol

One dimensional Sobol is actually just the Van der Corput sequence base 2 re-arranged a little bit, but it’s generated differently.

You start with 0 (either using it as sample 0 or sample -1, doesn’t matter which), and for each sample you do this:

Calculate the Ruler function value for the current sample’s index(more info in a second)
Make the direction vector by shifting 1 left (in binary) 31 – ruler times.
XOR the last sample by the direction vector to get the new sample
To interpret the sample as a floating point number you divide it by $2^{32}$

That might sound completely different than the Van der Corput sequence but it actually is the same thing – just re-ordered.

In the final step when dividing by $2^{32}$ , we are really just interpreting the binary number as a fraction just like before, but it’s the LEFT most digit that is the 1/2 spot, not the RIGHT most digit.

The Ruler Function goes like: 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, …

It’s pretty easy to calculate too. Calculating the ruler function for an index (starting at 1) is just the zero based index of the right most 1’s digit after converting the number to binary.

1 in binary is 001 so Ruler(1) is 0.
2 in binary is 010 so Ruler(2) is 1.
3 in binary is 011 so Ruler(3) is 0.
4 in binary is 100 so Ruler(4) is 2.
5 in binary is 101 so Ruler(5) is 0.
and so on.

Here is 1D Sobol:

Hammersley

In one dimension, the Hammersley sequence is the same as the base 2 Van der Corput sequence, and in the same order. If that sounds strange that it’s the same, it’s a 2d sequence I broke down into a 1d sequence for comparison. The one thing Hammersley has that makes it unique in the 1d case is that you can truncate bits.

It doesn’t seem that useful for 1d Hammersley to truncate bits but knowing that is useful info too I guess. Look at the 2d version of Hammersley to get a fairer look at it, because it’s meant to be a 2d sequence.

Here is Hammersley:

With 1 bit truncated:

With 2 bits truncated:

Poisson Disc

Poisson disc points are points which are densely packed, but have a minimum distance from each other.

Computer scientists are still working out good algorithms to generate these points efficiently.

I use “Mitchell’s Best-Candidate” which means that when you want to generate a new point in the sequence, you generate N new points, and choose whichever point is farthest away from the other points you’ve generated so far.

Here it is where N is 100:

2 Dimensional Sequences

Next up, let’s look at some 2 dimensional sequences.

Grid

Below is 2d uniform samples on a grid.

Note that uniform grid is not particularly low discrepancy for the 2d case! More info here: Is it expected that uniform points would have non zero discrepancy?

Random

Here are 100 random points:

Uniform Grid + Jitter

Here is a uniform grid that has random jitter applied to the points. Jittered grid is a pretty commonly used low discrepancy sampling technique that has good success.

Subrandom

Just like in 1 dimensions, you can apply the subrandom ideas to 2 dimensions where you divide the X and Y axis into so many sections, and randomly choose points in the sections.

If you divide X and Y into the same number of sections though, you are going to have a problem because some areas are not going to have any points in them.

@Reedbeta pointed out that instead of using i%x and i%y, that you could use i%x and (i/x)%y to make it pick points in all regions.

Picking different numbers for X and Y can be another way to give good results. Here’s dividing X and Y into 2 and 3 sections respectively:

If you choose co-prime numbers for divisions for each axis you can get maximal period of repeats. 2 and 3 are coprime so the last example is a good example of that, but here is 3 and 11:

Here is 3 and 97. 97 is large enough that with only doing 100 samples, we are almost doing jittered grid on the y axis.

Here is the other subrandom number from 1d, where we start with a random value for X and Y, and then add a random number between 0 and 0.5 to each, also adding 0.5, to make a “random walk” type setup again:

Halton

The Halton sequence is just the Van der Corput sequence, but using a different base on each axis.

Here is the Halton sequence where X and Y use bases 2 and 3:

Here it is using bases 5 and 7:

Here are bases 13 and 9:

Irrational Numbers

The irrational numbers technique can be used for 2d as well but I wasn’t able to find out how to make it give decent looking output that didn’t have an obvious diagonal pattern in them. Bart Wronski shared a neat paper that explains how to use the golden ratio in 2d with great success: Golden Ratio Sequences For Low-Discrepancy Sampling

This uses the golden ratio for the X axis and the square root of 2 for the Y axis. Below that is the same, with a random starting point, to make it give a different sequence.

Here X axis uses square root of 2 and Y axis uses square root of 3. Below that is a random starting point, which gives the same discrepancy.

Hammersley

In 2 dimensions, the Hammersley sequence uses the 1d Hammersley sequence for the X axis: Instead of treating the binary version of the index as binary, you treat it as fractions like you do for Van der Corput and sum up the fractions.

For the Y axis, you just reverse the bits and then do the same!

Here is the Hammersley sequence. Note we would have to take 128 samples (not just the 100 we did) if we wanted it to fill the entire square with samples.

Truncating bits in 2d is a bit useful. Here is 1 bit truncated:

2 bits truncated:

Poisson Disc

Using the same method we did for 1d, we can generate points in 2d space:

N Rooks

There is a sampling pattern called N-Rooks where you put N rooks onto a chess board and arrange them such that no two are in the same row or column.

A way to generate these samples is to realize that there will be only one rook per row, and that none of them will ever be in the same column. So, you make an array that has numbers 0 to N-1, and then shuffle the array. The index into the array is the row, and the value in the array is the column.

Here are 100 rooks:

Sobol

Sobol in two dimensions is more complex to explain so I’ll link you to the source I used: Sobol Sequences Made Simple.

The 1D sobol already covered is used for the X axis, and then something more complex was used for the Y axis:

Links

Bart Wronski has a really great series on a related topic: Dithering in Games

Wikipedia: Low Discrepancy Sequence

Wikipedia: Halton Sequence

Wikipedia: Van der Corput Sequence

Using Fibonacci Sequence To Generate Colors

Deeper info and usage cases for low discrepancy sequences

Poisson-Disc Sampling

Low discrepancy sequences are related to blue noise. Where white noise contains all frequencies evenly, blue noise has more high frequencies and fewer low frequencies. Blue noise is essentially the ultimate in low discrepancy, but can be expensive to compute. Here are some pages on blue noise:

Free Blue Noise Textures

The problem with 3D blue noise

Stippling and Blue Noise

Vegetation placement in “The Witness”

Here are some links from @marc_b_reynolds:
Sobol (low-discrepancy) sequence in 1-3D, stratified in 2-4D.
Classic binary-reflected gray code.
Sobol.h

Weyl Sequence

Code

#define _CRT_SECURE_NO_WARNINGS

#include <windows.h>  // for bitmap headers and performance counter.  Sorry non windows people!
#include <vector>
#include <stdint.h>
#include <random>
#include <array>
#include <algorithm>
#include <stdlib.h>
#include <set>

typedef uint8_t uint8;

#define NUM_SAMPLES 100  // to simplify some 2d code, this must be a square
#define NUM_SAMPLES_FOR_COLORING 100

// Turning this on will slow things down significantly because it's an O(N^5) operation for 2d!
#define CALCULATE_DISCREPANCY 0

#define IMAGE1D_WIDTH 600
#define IMAGE1D_HEIGHT 50
#define IMAGE2D_WIDTH 300
#define IMAGE2D_HEIGHT 300
#define IMAGE_PAD   30

#define IMAGE1D_CENTERX ((IMAGE1D_WIDTH+IMAGE_PAD*2)/2)
#define IMAGE1D_CENTERY ((IMAGE1D_HEIGHT+IMAGE_PAD*2)/2)
#define IMAGE2D_CENTERX ((IMAGE2D_WIDTH+IMAGE_PAD*2)/2)
#define IMAGE2D_CENTERY ((IMAGE2D_HEIGHT+IMAGE_PAD*2)/2)

#define AXIS_HEIGHT 40
#define DATA_HEIGHT 20
#define DATA_WIDTH 2

#define COLOR_FILL SColor(255,255,255)
#define COLOR_AXIS SColor(0, 0, 0)

//======================================================================================
struct SImageData
{
    SImageData ()
        : m_width(0)
        , m_height(0)
    { }
  
    size_t m_width;
    size_t m_height;
    size_t m_pitch;
    std::vector<uint8> m_pixels;
};

struct SColor
{
    SColor (uint8 _R = 0, uint8 _G = 0, uint8 _B = 0)
        : R(_R), G(_G), B(_B)
    { }

    uint8 B, G, R;
};

//======================================================================================
bool SaveImage (const char *fileName, const SImageData &image)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "wb");
    if (!file) {
        printf("Could not save %s\n", fileName);
        return false;
    }
  
    // make the header info
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
  
    header.bfType = 0x4D42;
    header.bfReserved1 = 0;
    header.bfReserved2 = 0;
    header.bfOffBits = 54;
  
    infoHeader.biSize = 40;
    infoHeader.biWidth = (LONG)image.m_width;
    infoHeader.biHeight = (LONG)image.m_height;
    infoHeader.biPlanes = 1;
    infoHeader.biBitCount = 24;
    infoHeader.biCompression = 0;
    infoHeader.biSizeImage = (DWORD) image.m_pixels.size();
    infoHeader.biXPelsPerMeter = 0;
    infoHeader.biYPelsPerMeter = 0;
    infoHeader.biClrUsed = 0;
    infoHeader.biClrImportant = 0;
  
    header.bfSize = infoHeader.biSizeImage + header.bfOffBits;
  
    // write the data and close the file
    fwrite(&header, sizeof(header), 1, file);
    fwrite(&infoHeader, sizeof(infoHeader), 1, file);
    fwrite(&image.m_pixels[0], infoHeader.biSizeImage, 1, file);
    fclose(file);
 
    return true;
}

//======================================================================================
void ImageInit (SImageData& image, size_t width, size_t height)
{
    image.m_width = width;
    image.m_height = height;
    image.m_pitch = 4 * ((width * 24 + 31) / 32);
    image.m_pixels.resize(image.m_pitch * image.m_width);
    std::fill(image.m_pixels.begin(), image.m_pixels.end(), 0);
}

//======================================================================================
void ImageClear (SImageData& image, const SColor& color)
{
    uint8* row = &image.m_pixels[0];
    for (size_t rowIndex = 0; rowIndex < image.m_height; ++rowIndex)
    {
        SColor* pixels = (SColor*)row;
        std::fill(pixels, pixels + image.m_width, color);

        row += image.m_pitch;
    }
}

//======================================================================================
void ImageBox (SImageData& image, size_t x1, size_t x2, size_t y1, size_t y2, const SColor& color)
{
    for (size_t y = y1; y < y2; ++y)
    {
        uint8* row = &image.m_pixels[y * image.m_pitch];
        SColor* start = &((SColor*)row)[x1];
        std::fill(start, start + x2 - x1, color);
    }
}

//======================================================================================
float Distance (float x1, float y1, float x2, float y2)
{
    float dx = (x2 - x1);
    float dy = (y2 - y1);

    return std::sqrtf(dx*dx + dy*dy);
}

//======================================================================================
SColor DataPointColor (size_t sampleIndex)
{
    SColor ret;
    float percent = (float(sampleIndex) / (float(NUM_SAMPLES_FOR_COLORING) - 1.0f));

    ret.R = uint8((1.0f - percent) * 255.0f);
    ret.G = 0;
    ret.B = uint8(percent * 255.0f);

    float mag = (float)sqrt(ret.R*ret.R + ret.G*ret.G + ret.B*ret.B);
    ret.R = uint8((float(ret.R) / mag)*255.0f);
    ret.G = uint8((float(ret.G) / mag)*255.0f);
    ret.B = uint8((float(ret.B) / mag)*255.0f);

    return ret;
}

//======================================================================================
float RandomFloat (float min, float max)
{
    static std::random_device rd;
    static std::mt19937 mt(rd());
    std::uniform_real_distribution<float> dist(min, max);
    return dist(mt);
}

//======================================================================================
size_t Ruler (size_t n)
{
    size_t ret = 0;
    while (n != 0 && (n & 1) == 0)
    {
        n /= 2;
        ++ret;
    }
    return ret;
}

//======================================================================================
float CalculateDiscrepancy1D (const std::array<float, NUM_SAMPLES>& samples)
{
    // some info about calculating discrepancy
    // https://math.stackexchange.com/questions/1681562/how-to-calculate-discrepancy-of-a-sequence

    // Calculates the discrepancy of this data.
    // Assumes the data is [0,1) for valid sample range
    std::array<float, NUM_SAMPLES> sortedSamples = samples;
    std::sort(sortedSamples.begin(), sortedSamples.end());

    float maxDifference = 0.0f;
    for (size_t startIndex = 0; startIndex <= NUM_SAMPLES; ++startIndex)
    {
        // startIndex 0 = 0.0f.  startIndex 1 = sortedSamples[0]. etc

        float startValue = 0.0f;
        if (startIndex > 0)
            startValue = sortedSamples[startIndex - 1];

        for (size_t stopIndex = startIndex; stopIndex <= NUM_SAMPLES; ++stopIndex)
        {
            // stopIndex 0 = sortedSamples[0].  startIndex[N] = 1.0f. etc

            float stopValue = 1.0f;
            if (stopIndex < NUM_SAMPLES)
                stopValue = sortedSamples[stopIndex];

            float length = stopValue - startValue;

            // open interval (startValue, stopValue)
            size_t countInside = 0;
            for (float sample : samples)
            {
                if (sample > startValue &&
                    sample < stopValue)
                {
                    ++countInside;
                }
            }
            float density = float(countInside) / float(NUM_SAMPLES);
            float difference = std::abs(density - length);
            if (difference > maxDifference)
                maxDifference = difference;

            // closed interval [startValue, stopValue]
            countInside = 0;
            for (float sample : samples)
            {
                if (sample >= startValue &&
                    sample <= stopValue)
                {
                    ++countInside;
                }
            }
            density = float(countInside) / float(NUM_SAMPLES);
            difference = std::abs(density - length);
            if (difference > maxDifference)
                maxDifference = difference;
        }
    }
    return maxDifference;
}

//======================================================================================
float CalculateDiscrepancy2D (const std::array<std::array<float, 2>, NUM_SAMPLES>& samples)
{
    // some info about calculating discrepancy
    // https://math.stackexchange.com/questions/1681562/how-to-calculate-discrepancy-of-a-sequence

    // Calculates the discrepancy of this data.
    // Assumes the data is [0,1) for valid sample range.

    // Get the sorted list of unique values on each axis
    std::set<float> setSamplesX;
    std::set<float> setSamplesY;
    for (const std::array<float, 2>& sample : samples)
    {
        setSamplesX.insert(sample[0]);
        setSamplesY.insert(sample[1]);
    }
    std::vector<float> sortedXSamples;
    std::vector<float> sortedYSamples;
    sortedXSamples.reserve(setSamplesX.size());
    sortedYSamples.reserve(setSamplesY.size());
    for (float f : setSamplesX)
        sortedXSamples.push_back(f);
    for (float f : setSamplesY)
        sortedYSamples.push_back(f);

    // Get the sorted list of samples on the X axis, for faster interval testing
    std::array<std::array<float, 2>, NUM_SAMPLES> sortedSamplesX = samples;
    std::sort(sortedSamplesX.begin(), sortedSamplesX.end(),
        [] (const std::array<float, 2>& itemA, const std::array<float, 2>& itemB)
        {
            return itemA[0] < itemB[0];
        }
    );

    // calculate discrepancy
    float maxDifference = 0.0f;
    for (size_t startIndexY = 0; startIndexY <= sortedYSamples.size(); ++startIndexY)
    {
        float startValueY = 0.0f;
        if (startIndexY > 0)
            startValueY = *(sortedYSamples.begin() + startIndexY - 1);

        for (size_t startIndexX = 0; startIndexX <= sortedXSamples.size(); ++startIndexX)
        {
            float startValueX = 0.0f;
            if (startIndexX > 0)
                startValueX = *(sortedXSamples.begin() + startIndexX - 1);

            for (size_t stopIndexY = startIndexY; stopIndexY <= sortedYSamples.size(); ++stopIndexY)
            {
                float stopValueY = 1.0f;
                if (stopIndexY < sortedYSamples.size())
                    stopValueY = sortedYSamples[stopIndexY];

                for (size_t stopIndexX = startIndexX; stopIndexX <= sortedXSamples.size(); ++stopIndexX)
                {
                    float stopValueX = 1.0f;
                    if (stopIndexX < sortedXSamples.size())
                        stopValueX = sortedXSamples[stopIndexX];

                    // calculate area
                    float length = stopValueX - startValueX;
                    float height = stopValueY - startValueY;
                    float area = length * height;

                    // open interval (startValue, stopValue)
                    size_t countInside = 0;
                    for (const std::array<float, 2>& sample : samples)
                    {
                        if (sample[0] > startValueX &&
                            sample[1] > startValueY &&
                            sample[0] < stopValueX &&
                            sample[1] < stopValueY)
                        {
                            ++countInside;
                        }
                    }
                    float density = float(countInside) / float(NUM_SAMPLES);
                    float difference = std::abs(density - area);
                    if (difference > maxDifference)
                        maxDifference = difference;

                    // closed interval [startValue, stopValue]
                    countInside = 0;
                    for (const std::array<float, 2>& sample : samples)
                    {
                        if (sample[0] >= startValueX &&
                            sample[1] >= startValueY &&
                            sample[0] <= stopValueX &&
                            sample[1] <= stopValueY)
                        {
                            ++countInside;
                        }
                    }
                    density = float(countInside) / float(NUM_SAMPLES);
                    difference = std::abs(density - area);
                    if (difference > maxDifference)
                        maxDifference = difference;
                }
            }
        }
    }

    return maxDifference;
}

//======================================================================================
void Test1D (const char* fileName, const std::array<float, NUM_SAMPLES>& samples)
{
    // create and clear the image
    SImageData image;
    ImageInit(image, IMAGE1D_WIDTH + IMAGE_PAD * 2, IMAGE1D_HEIGHT + IMAGE_PAD * 2);

    // setup the canvas
    ImageClear(image, COLOR_FILL);

    // calculate the discrepancy
    #if CALCULATE_DISCREPANCY
        float discrepancy = CalculateDiscrepancy1D(samples);
        printf("%s Discrepancy = %0.2f%%\n", fileName, discrepancy*100.0f);
    #endif

    // draw the sample points
    size_t i = 0;
    for (float f: samples)
    {
        size_t pos = size_t(f * float(IMAGE1D_WIDTH)) + IMAGE_PAD;
        ImageBox(image, pos, pos + 1, IMAGE1D_CENTERY - DATA_HEIGHT / 2, IMAGE1D_CENTERY + DATA_HEIGHT / 2, DataPointColor(i));
        ++i;
    }

    // draw the axes lines. horizontal first then the two vertical
    ImageBox(image, IMAGE_PAD, IMAGE1D_WIDTH + IMAGE_PAD, IMAGE1D_CENTERY, IMAGE1D_CENTERY + 1, COLOR_AXIS);
    ImageBox(image, IMAGE_PAD, IMAGE_PAD + 1, IMAGE1D_CENTERY - AXIS_HEIGHT / 2, IMAGE1D_CENTERY + AXIS_HEIGHT / 2, COLOR_AXIS);
    ImageBox(image, IMAGE1D_WIDTH + IMAGE_PAD, IMAGE1D_WIDTH + IMAGE_PAD + 1, IMAGE1D_CENTERY - AXIS_HEIGHT / 2, IMAGE1D_CENTERY + AXIS_HEIGHT / 2, COLOR_AXIS);

    // save the image
    SaveImage(fileName, image);
}

//======================================================================================
void Test2D (const char* fileName, const std::array<std::array<float,2>, NUM_SAMPLES>& samples)
{
    // create and clear the image
    SImageData image;
    ImageInit(image, IMAGE2D_WIDTH + IMAGE_PAD * 2, IMAGE2D_HEIGHT + IMAGE_PAD * 2);
    
    // setup the canvas
    ImageClear(image, COLOR_FILL);

    // calculate the discrepancy
    #if CALCULATE_DISCREPANCY
        float discrepancy = CalculateDiscrepancy2D(samples);
        printf("%s Discrepancy = %0.2f%%\n", fileName, discrepancy*100.0f);
    #endif

    // draw the sample points
    size_t i = 0;
    for (const std::array<float, 2>& sample : samples)
    {
        size_t posx = size_t(sample[0] * float(IMAGE2D_WIDTH)) + IMAGE_PAD;
        size_t posy = size_t(sample[1] * float(IMAGE2D_WIDTH)) + IMAGE_PAD;
        ImageBox(image, posx - 1, posx + 1, posy - 1, posy + 1, DataPointColor(i));
        ++i;
    }

    // horizontal lines
    ImageBox(image, IMAGE_PAD - 1, IMAGE2D_WIDTH + IMAGE_PAD + 1, IMAGE_PAD - 1, IMAGE_PAD, COLOR_AXIS);
    ImageBox(image, IMAGE_PAD - 1, IMAGE2D_WIDTH + IMAGE_PAD + 1, IMAGE2D_HEIGHT + IMAGE_PAD, IMAGE2D_HEIGHT + IMAGE_PAD + 1, COLOR_AXIS);

    // vertical lines
    ImageBox(image, IMAGE_PAD - 1, IMAGE_PAD, IMAGE_PAD - 1, IMAGE2D_HEIGHT + IMAGE_PAD + 1, COLOR_AXIS);
    ImageBox(image, IMAGE_PAD + IMAGE2D_WIDTH, IMAGE_PAD + IMAGE2D_WIDTH + 1, IMAGE_PAD - 1, IMAGE2D_HEIGHT + IMAGE_PAD + 1, COLOR_AXIS);

    // save the image
    SaveImage(fileName, image);
}

//======================================================================================
void TestUniform1D (bool jitter)
{
    // calculate the sample points
    const float c_cellSize = 1.0f / float(NUM_SAMPLES+1);
    std::array<float, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i] = float(i+1) / float(NUM_SAMPLES+1);
        if (jitter)
            samples[i] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);
    }

    // save bitmap etc
    if (jitter)
        Test1D("1DUniformJitter.bmp", samples);
    else
        Test1D("1DUniform.bmp", samples);
}

//======================================================================================
void TestUniformRandom1D ()
{
    // calculate the sample points
    const float c_halfJitter = 1.0f / float((NUM_SAMPLES + 1) * 2);
    std::array<float, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
        samples[i] = RandomFloat(0.0f, 1.0f);

    // save bitmap etc
    Test1D("1DUniformRandom.bmp", samples);
}

//======================================================================================
void TestSubRandomA1D (size_t numRegions)
{
    const float c_randomRange = 1.0f / float(numRegions);

    // calculate the sample points
    const float c_halfJitter = 1.0f / float((NUM_SAMPLES + 1) * 2);
    std::array<float, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i] = RandomFloat(0.0f, c_randomRange);
        samples[i] += float(i % numRegions) / float(numRegions);
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "1DSubRandomA_%zu.bmp", numRegions);
    Test1D(fileName, samples);
}

//======================================================================================
void TestSubRandomB1D ()
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    float sample = RandomFloat(0.0f, 0.5f);
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        sample = std::fmodf(sample + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
        samples[i] = sample;
    }

    // save bitmap etc
    Test1D("1DSubRandomB.bmp", samples);
}

//======================================================================================
void TestVanDerCorput (size_t base)
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i] = 0.0f;
        float denominator = float(base);
        size_t n = i;
        while (n > 0)
        {
            size_t multiplier = n % base;
            samples[i] += float(multiplier) / denominator;
            n = n / base;
            denominator *= base;
        }
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "1DVanDerCorput_%zu.bmp", base);
    Test1D(fileName, samples);
}

//======================================================================================
void TestIrrational1D (float irrational, float seed)
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    float sample = seed;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        sample = std::fmodf(sample + irrational, 1.0f);
        samples[i] = sample;
    }

    // save bitmap etc
    char irrationalStr[256];
    sprintf(irrationalStr, "%f", irrational);
    char seedStr[256];
    sprintf(seedStr, "%f", seed);
    char fileName[256];
    sprintf(fileName, "1DIrrational_%s_%s.bmp", &irrationalStr[2], &seedStr[2]);
    Test1D(fileName, samples);
}

//======================================================================================
void TestSobol1D ()
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    size_t sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        size_t ruler = Ruler(i + 1);
        size_t direction = size_t(size_t(1) << size_t(31 - ruler));
        sampleInt = sampleInt ^ direction;
        samples[i] = float(sampleInt) / std::pow(2.0f, 32.0f);
    }

    // save bitmap etc
    Test1D("1DSobol.bmp", samples);
}

//======================================================================================
void TestHammersley1D (size_t truncateBits)
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    size_t sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        size_t n = i >> truncateBits;
        float base = 1.0f / 2.0f;
        samples[i] = 0.0f;
        while (n)
        {
            if (n & 1)
                samples[i] += base;
            n /= 2;
            base /= 2.0f;
        }
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "1DHammersley_%zu.bmp", truncateBits);
    Test1D(fileName, samples);
}

//======================================================================================
float MinimumDistance1D (const std::array<float, NUM_SAMPLES>& samples, size_t numSamples, float x)
{
    // Used by poisson.
    // This returns the minimum distance that point (x) is away from the sample points, from [0, numSamples).
    float minimumDistance = 0.0f;
    for (size_t i = 0; i < numSamples; ++i)
    {
        float distance = std::abs(samples[i] - x);
        if (i == 0 || distance < minimumDistance)
            minimumDistance = distance;
    }
    return minimumDistance;
}

//======================================================================================
void TestPoisson1D ()
{
    // every time we want to place a point, we generate this many points and choose the one farthest away from all the other points (largest minimum distance)
    const size_t c_bestOfAttempts = 100;

    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    for (size_t sampleIndex = 0; sampleIndex < NUM_SAMPLES; ++sampleIndex)
    {
        // generate some random points and keep the one that has the largest minimum distance from any of the existing points
        float bestX = 0.0f;
        float bestMinDistance = 0.0f;
        for (size_t attempt = 0; attempt < c_bestOfAttempts; ++attempt)
        {
            float attemptX = RandomFloat(0.0f, 1.0f);
            float minDistance = MinimumDistance1D(samples, sampleIndex, attemptX);

            if (minDistance > bestMinDistance)
            {
                bestX = attemptX;
                bestMinDistance = minDistance;
            }
        }
        samples[sampleIndex] = bestX;
    }

    // save bitmap etc
    Test1D("1DPoisson.bmp", samples);
}

//======================================================================================
void TestUniform2D (bool jitter)
{
    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
    const float c_cellSize = 1.0f / float(c_oneSide+1);
    for (size_t iy = 0; iy < c_oneSide; ++iy)
    {
        for (size_t ix = 0; ix < c_oneSide; ++ix)
        {
            size_t sampleIndex = iy * c_oneSide + ix;

            samples[sampleIndex][0] = float(ix + 1) / (float(c_oneSide + 1));
            if (jitter)
                samples[sampleIndex][0] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);

            samples[sampleIndex][1] = float(iy + 1) / (float(c_oneSide) + 1.0f);
            if (jitter)
                samples[sampleIndex][1] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);
        }
    }

    // save bitmap etc
    if (jitter)
        Test2D("2DUniformJitter.bmp", samples);
    else
        Test2D("2DUniform.bmp", samples);
}

//======================================================================================
void TestUniformRandom2D ()
{
    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
    const float c_halfJitter = 1.0f / float((c_oneSide + 1) * 2);
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i][0] = RandomFloat(0.0f, 1.0f);
        samples[i][1] = RandomFloat(0.0f, 1.0f);
    }

    // save bitmap etc
    Test2D("2DUniformRandom.bmp", samples);
}

//======================================================================================
void TestSubRandomA2D (size_t regionsX, size_t regionsY)
{
    const float c_randomRangeX = 1.0f / float(regionsX);
    const float c_randomRangeY = 1.0f / float(regionsY);

    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i][0] = RandomFloat(0.0f, c_randomRangeX);
        samples[i][0] += float(i % regionsX) / float(regionsX);

        samples[i][1] = RandomFloat(0.0f, c_randomRangeY);
        samples[i][1] += float(i % regionsY) / float(regionsY);
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "2DSubRandomA_%zu_%zu.bmp", regionsX, regionsY);
    Test2D(fileName, samples);
}

//======================================================================================
void TestSubRandomB2D ()
{
    // calculate the sample points
    float samplex = RandomFloat(0.0f, 0.5f);
    float sampley = RandomFloat(0.0f, 0.5f);
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samplex = std::fmodf(samplex + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
        sampley = std::fmodf(sampley + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
        samples[i][0] = samplex;
        samples[i][1] = sampley;
    }
    
    // save bitmap etc
    Test2D("2DSubRandomB.bmp", samples);
}

//======================================================================================
void TestHalton (size_t basex, size_t basey)
{
    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
    const float c_halfJitter = 1.0f / float((c_oneSide + 1) * 2);
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        // x axis
        samples[i][0] = 0.0f;
        {
            float denominator = float(basex);
            size_t n = i;
            while (n > 0)
            {
                size_t multiplier = n % basex;
                samples[i][0] += float(multiplier) / denominator;
                n = n / basex;
                denominator *= basex;
            }
        }

        // y axis
        samples[i][1] = 0.0f;
        {
            float denominator = float(basey);
            size_t n = i;
            while (n > 0)
            {
                size_t multiplier = n % basey;
                samples[i][1] += float(multiplier) / denominator;
                n = n / basey;
                denominator *= basey;
            }
        }
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "2DHalton_%zu_%zu.bmp", basex, basey);
    Test2D(fileName, samples);
}

//======================================================================================
void TestSobol2D ()
{
    // calculate the sample points

    // x axis
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    size_t sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        size_t ruler = Ruler(i + 1);
        size_t direction = size_t(size_t(1) << size_t(31 - ruler));
        sampleInt = sampleInt ^ direction;
        samples[i][0] = float(sampleInt) / std::pow(2.0f, 32.0f);
    }

    // y axis
    // Code adapted from http://web.maths.unsw.edu.au/~fkuo/sobol/
    // uses numbers: new-joe-kuo-6.21201

    // Direction numbers
    std::vector<size_t> V;
    V.resize((size_t)ceil(log((double)NUM_SAMPLES) / log(2.0)));
    V[0] = size_t(1) << size_t(31);
    for (size_t i = 1; i < V.size(); ++i)
        V[i] = V[i - 1] ^ (V[i - 1] >> 1);

    // Samples
    sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i) {
        size_t ruler = Ruler(i + 1);
        sampleInt = sampleInt ^ V[ruler];
        samples[i][1] = float(sampleInt) / std::pow(2.0f, 32.0f);
    }

    // save bitmap etc
    Test2D("2DSobol.bmp", samples);
}

//======================================================================================
void TestHammersley2D (size_t truncateBits)
{
    // figure out how many bits we are working in.
    size_t value = 1;
    size_t numBits = 0;
    while (value < NUM_SAMPLES)
    {
        value *= 2;
        ++numBits;
    }

    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    size_t sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        // x axis
        samples[i][0] = 0.0f;
        {
            size_t n = i >> truncateBits;
            float base = 1.0f / 2.0f;
            while (n)
            {
                if (n & 1)
                    samples[i][0] += base;
                n /= 2;
                base /= 2.0f;
            }
        }

        // y axis
        samples[i][1] = 0.0f;
        {
            size_t n = i >> truncateBits;
            size_t mask = size_t(1) << (numBits - 1 - truncateBits);

            float base = 1.0f / 2.0f;
            while (mask)
            {
                if (n & mask)
                    samples[i][1] += base;
                mask /= 2;
                base /= 2.0f;
            }
        }
    }


    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "2DHammersley_%zu.bmp", truncateBits);
    Test2D(fileName, samples);
}

//======================================================================================
void TestRooks2D ()
{
    // make and shuffle rook positions
    std::random_device rd;
    std::mt19937 mt(rd());
    std::array<size_t, NUM_SAMPLES> rookPositions;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
        rookPositions[i] = i;
    std::shuffle(rookPositions.begin(), rookPositions.end(), mt);

    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        // x axis
        samples[i][0] = float(rookPositions[i]) / float(NUM_SAMPLES-1);

        // y axis
        samples[i][1] = float(i) / float(NUM_SAMPLES - 1);
    }

    // save bitmap etc
    Test2D("2DRooks.bmp", samples);
}

//======================================================================================
void TestIrrational2D (float irrationalx, float irrationaly, float seedx, float seedy)
{
    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    float samplex = seedx;
    float sampley = seedy;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samplex = std::fmodf(samplex + irrationalx, 1.0f);
        sampley = std::fmodf(sampley + irrationaly, 1.0f);

        samples[i][0] = samplex;
        samples[i][1] = sampley;
    }

    // save bitmap etc
    char irrationalxStr[256];
    sprintf(irrationalxStr, "%f", irrationalx);
    char irrationalyStr[256];
    sprintf(irrationalyStr, "%f", irrationaly);
    char seedxStr[256];
    sprintf(seedxStr, "%f", seedx);
    char seedyStr[256];
    sprintf(seedyStr, "%f", seedy);
    char fileName[256];
    sprintf(fileName, "2DIrrational_%s_%s_%s_%s.bmp", &irrationalxStr[2], &irrationalyStr[2], &seedxStr[2], &seedyStr[2]);
    Test2D(fileName, samples);
}

//======================================================================================
float MinimumDistance2D (const std::array<std::array<float, 2>, NUM_SAMPLES>& samples, size_t numSamples, float x, float y)
{
    // Used by poisson.
    // This returns the minimum distance that point (x,y) is away from the sample points, from [0, numSamples).
    float minimumDistance = 0.0f;
    for (size_t i = 0; i < numSamples; ++i)
    {
        float distance = Distance(samples[i][0], samples[i][1], x, y);
        if (i == 0 || distance < minimumDistance)
            minimumDistance = distance;
    }
    return minimumDistance;
}

//======================================================================================
void TestPoisson2D ()
{
    // every time we want to place a point, we generate this many points and choose the one farthest away from all the other points (largest minimum distance)
    const size_t c_bestOfAttempts = 100;

    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    for (size_t sampleIndex = 0; sampleIndex < NUM_SAMPLES; ++sampleIndex)
    {
        // generate some random points and keep the one that has the largest minimum distance from any of the existing points
        float bestX = 0.0f;
        float bestY = 0.0f;
        float bestMinDistance = 0.0f;
        for (size_t attempt = 0; attempt < c_bestOfAttempts; ++attempt)
        {
            float attemptX = RandomFloat(0.0f, 1.0f);
            float attemptY = RandomFloat(0.0f, 1.0f);
            float minDistance = MinimumDistance2D(samples, sampleIndex, attemptX, attemptY);

            if (minDistance > bestMinDistance)
            {
                bestX = attemptX;
                bestY = attemptY;
                bestMinDistance = minDistance;
            }
        }
        samples[sampleIndex][0] = bestX;
        samples[sampleIndex][1] = bestY;
    }

    // save bitmap etc
    Test2D("2DPoisson.bmp", samples);
}

//======================================================================================
int main (int argc, char **argv)
{
    // 1D tests
    {
        TestUniform1D(false);
        TestUniform1D(true);

        TestUniformRandom1D();

        TestSubRandomA1D(2);
        TestSubRandomA1D(4);
        TestSubRandomA1D(8);
        TestSubRandomA1D(16);
        TestSubRandomA1D(32);

        TestSubRandomB1D();

        TestVanDerCorput(2);
        TestVanDerCorput(3);
        TestVanDerCorput(4);
        TestVanDerCorput(5);

        // golden ratio mod 1 aka (sqrt(5) - 1)/2
        TestIrrational1D(0.618034f, 0.0f);
        TestIrrational1D(0.618034f, 0.385180f);
        TestIrrational1D(0.618034f, 0.775719f);
        TestIrrational1D(0.618034f, 0.287194f);

        // sqrt(2) - 1
        TestIrrational1D(0.414214f, 0.0f);
        TestIrrational1D(0.414214f, 0.385180f);
        TestIrrational1D(0.414214f, 0.775719f);
        TestIrrational1D(0.414214f, 0.287194f);

        // PI mod 1
        TestIrrational1D(0.141593f, 0.0f);
        TestIrrational1D(0.141593f, 0.385180f);
        TestIrrational1D(0.141593f, 0.775719f);
        TestIrrational1D(0.141593f, 0.287194f);
        
        TestSobol1D();

        TestHammersley1D(0);
        TestHammersley1D(1);
        TestHammersley1D(2);

        TestPoisson1D();
    }

    // 2D tests
    {
        TestUniform2D(false);
        TestUniform2D(true);

        TestUniformRandom2D();

        TestSubRandomA2D(2, 2);
        TestSubRandomA2D(2, 3);
        TestSubRandomA2D(3, 11);
        TestSubRandomA2D(3, 97);

        TestSubRandomB2D();

        TestHalton(2, 3);
        TestHalton(5, 7);
        TestHalton(13, 9);

        TestSobol2D();

        TestHammersley2D(0);
        TestHammersley2D(1);
        TestHammersley2D(2);

        TestRooks2D();

        // X axis = golden ratio mod 1 aka (sqrt(5)-1)/2
        // Y axis = sqrt(2) mod 1
        TestIrrational2D(0.618034f, 0.414214f, 0.0f, 0.0f);
        TestIrrational2D(0.618034f, 0.414214f, 0.775719f, 0.264045f);

        // X axis = sqrt(2) mod 1
        // Y axis = sqrt(3) mod 1
        TestIrrational2D(std::fmodf((float)std::sqrt(2.0f), 1.0f), std::fmodf((float)std::sqrt(3.0f), 1.0f), 0.0f, 0.0f);
        TestIrrational2D(std::fmodf((float)std::sqrt(2.0f), 1.0f), std::fmodf((float)std::sqrt(3.0f), 1.0f), 0.775719f, 0.264045f);

        TestPoisson2D();
    }

    #if CALCULATE_DISCREPANCY
        printf("\n");
        system("pause");
    #endif
}

Incremental Least Squares Curve Fitting

This Post In Short:

Fit a curve of degree N to a data set, getting data points 1 at a time.
Storage Required: 3*N+2 values.
Update Complexity: roughly 3*N+2 additions and multiplies.
Finalize Complexity: Solving Ax=b where A is an (N+1)x(N+1) matrix and b is a known vector. (Sample code inverts A matrix and multiplies by b, Gaussian elimination is better though).
Simple C++ code and HTML5 demo at bottom!

I was recently reading a post from a buddy on OIT or “Order Independent Transparency” which is an open problem in graphics:
Fourier series based OIT and why it won’t work

In the article he talks about trying to approximate a function per pixel and shows the details of some methods he tried. One of the difficulties with the problem is that during a render you can get any number of triangles affecting a specific pixel, but you need a fixed and bounded size amount of storage per pixel for those variable numbers of data points.

That made me wonder: Is there an algorithm that can approximate a data set with a function, getting only one data point at a time, and end up with a decent approximation?

It turns out that there is one, at least one that I am happy with: Incremental Least Squares Curve Fitting.

While this perhaps doesn’t address all the problems that need addressing for OIT specifically, I think this is a great technique for programming in general, and I’m betting it still has it’s uses in graphics, for other times when you want to approximate a data set per pixel.

We’ll work through a math oriented way to do it, and then we’ll convert it into an equivalent and simpler programmer friendly version.

At the bottom of the post is some simple C++ that implements everything we talk about and the image below is a screenshot of an an interactive HTML5 demo I made: Least Squares Curve Fitting

Mathy Version

I found out about this technique by asking on math stack exchange and got a great (if not mathematically dense!) answer:
Math Stack Exchange: Creating a function incrementally

I have to admit, I’m not so great with matrices outside of the typical graphics/gamedev usage cases of transormation and related, so it took me a few days to work through it and understand it all. If reading that answer made your eyes go blurry, give my explanation a shot. I’m hoping I gave step by step details enough such that you too can understand what the heck he was talking about. If not, let me know where you got lost and I can explain better and update the post.

The first thing we need to do is figure out what degree of a function we want to approximate our data with. For our example we’ll pick a degree 2 function, also known as a quadratic function. That means that when we are done we will get out a function of the form below:

$y=ax^2+bx+c$

We will give data points to the equation and it will calculate the values of a,b and c that approximate our function by minimizing the sum of the squared distance from each point to the curve.

We’ll deal with regular least squared fitting before moving onto incremental, so here’s the data set we’ll be fitting our quadratic curve to:

$(1,5),(2,16),(3,31),(4,50)$

The x values in my data set start at 1 and count up by 1, but that is not a requirement. You can use whatever x and y values you want to fit a curve to.

Next we need to calculate the matrix $A$ , where $A_{jk} = x_j^k$ and the matrix has NumDataPoints rows and Degree+1 columns. It looks like the below for a quadratic curve fitting 4 data points:

$A = \begin{bmatrix} x_0^0 & x_0^1 & x_0^2 \\ x_1^0 & x_1^1 & x_1^2 \\ x_2^0 & x_2^1 & x_2^2 \\ x_3^0 & x_3^1 & x_3^2 \\ \end{bmatrix}$

When we plug in our specific x values we get this:

$A = \begin{bmatrix} 1^0 & 1^1 & 1^2 \\ 2^0 & 2^1 & 2^2 \\ 3^0 & 3^1 & 3^2 \\ 4^0 & 4^1 & 4^2 \\ \end{bmatrix}$

Calculating it out we get this:

$A = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ \end{bmatrix}$

Next we need to calculate the matrix $A^TA$ , which we do below by multiplying the transpose of A by A:

$A^TA = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \\ 1 & 4 & 9 & 16 \\ \end{bmatrix} * \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ \end{bmatrix} = \begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

Next we need to find the inverse of that matrix to get $(A^TA)^{-1}$ . The inverse is:

$(A^TA)^{-1} = \begin{bmatrix} 31/4 & -27/4 & 5/4 \\ -27/4 & 129/20 & -5/4 \\ 5/4 & -5/4 & 1/4 \\ \end{bmatrix}$

The next thing we need to calculate is $A^TY$ , which is the transpose of A multiplied by all of the Y values of our data:

$A^TY = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \\ 1 & 4 & 9 & 16 \\ \end{bmatrix} * \begin{bmatrix} 5 \\ 16 \\ 31 \\ 50 \\ \end{bmatrix} = \begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

And finally, to calculate the coefficients of our quadratic function, we need to calculate $(A^TA)^{-1}*A^TY$ :

$(A^TA)^{-1}*A^TY = \begin{bmatrix} 31/4 & -27/4 & 5/4 \\ -27/4 & 129/20 & -5/4 \\ 5/4 & -5/4 & 1/4 \\ \end{bmatrix} * \begin{bmatrix} 102 \\ 330 \\ 1148 \\ \end{bmatrix} = \begin{bmatrix} -2 & 5 & 2 \\ \end{bmatrix}$

Those coefficients are listed in power order of x, so the first value -2 is the coefficient for x^0, 5 is the coefficient for x^1 and 2 is the coefficient for x^2. That gives us the equation:

$y=2x^2+5x-2$

If you plug in the x values from our data set, you’ll find that this curve perfectly fits all 4 data points.

It won’t always be (and usually won’t be) that a resulting curve matches the input set for all values. It just so happened that this time it does. The only guarantee you’ll get when fitting a curve to the data points is that the squared distance of the point to the curve (distance on the Y axis only, so vertical distance), is minimized for all data points.

Now that we’ve worked through the math, let’s make some observations and make it more programmer friendly.

Making it Programmer Friendly

Let’s look at the $A^TA$ matrix again:

$\begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

One thing you probably noticed right away is that it’s symmetric across the diagonal. Another thing you may have noticed is that there are only 5 unique values in that matrix.

As it turns out, those 5 values are just the sum of the x values, when those x values are raised to increasing powers.

If you take all x values of our data set, raise them to the 0th power and sum the results, you get 4.
If you take all x values of our data set, raise them to the 1st power and sum the results, you get 10.
If you take all x values of our data set, raise them to the 2nd power and sum the results, you get 30.
If you take all x values of our data set, raise them to the 3rd power and sum the results, you get 100.
If you take all x values of our data set, raise them to the 4th power and sum the results, you get 354.

Further more, the power of the x values in each part of the matrix is the zero based x axis index plus the zero based y axis index. Check out what i mean below, which shows which power the x values are taken to before being summed for each location in the matrix:

$\begin{bmatrix} 0 & 1 & 2 \\ 1 & 2 & 3 \\ 2 & 3 & 4 \\ \end{bmatrix}$

That is interesting for two reasons…

This tells us that we only really need to store the 5 unique values, and that we can reconstruct the full matrix later when it’s time to calculate the coefficients.
It also tells us that if we’ve fit a curve to some data points, but then want to add a new data point, that we can just raise the x value of our new data point to the different powers and add it into these 5 values we already have stored. In other words, the $A^TA$ matrix can be incrementally adjusted as new data comes in.

This generalizes beyond quadratic functions too luckily. If you are fitting your data points with a degree N curve, the $A^TA$ matrix will have N+1 rows, and N+1 columns, but will only have (N+1)*2-1 unique values stored in it. Those values will be the sum of the x values taken from the 0th power up to the (N+1)*2-2th power.

As a concrete example, a cubic fit will have an $A^TA$ array that is 4×4, which will only have 7 unique values stored in it. Those values will be the x values raised to the 0th power and summed, all the way up to the x values raised to the 6th power and summed.

So, the $A^TA$ matrix has a fixed storage amount of (degree+1)*2 – 1 values, and it can be incrementally updated.

That is great, but there is another value we need to look at too, which is the $A^TY$ vector. Let’s see that again:

$\begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

There are some patterns to this vector too luckily. You may have noticed that the first entry is the sum of the Y values from our data set. It’s actually the sum of the y values multiplied by the x values raised to the 0th power.

The next number is the sum of the y values multiplied by the x values raised to the 1st power, and so on.

To generalize it, each entry in that vector is the sum of taking the x from each data point, raising it to the power that is the index in the vector, and multiplying it by the y value.

Taking each data point’s x value, raising it to the 0th power, multiplying by the y value, and summing the results gives you 102.
Taking each data point’s x value, raising it to the 1st power, multiplying by the y value, and summing the results gives you 330.
Taking each data point’s x value, raising it to the 2nd power, multiplying by the y value, and summing the results gives you 1148.

So, this vector is incrementally updatable too. When you get a new data point, for each entry in the vector, you take the x value to the specific power, multiply by y, and add that result to the entry in the vector.

This generalizes for other curve types as well. If you are fitting your data points with a degree N curve, the $A^TY$ vector will have N+1 entries, corresponding to the powers: 0,1,…N.

As a concrete example, a cubic fit will have an $A^TY$ vector of size 4, corresponding to the powers: 0,1,2,3.

Combining the storage needs of the values needed for the $A^TA$ matrix, as well as the values needed for the $A^TY$ vector, the amount of storage we need for a degree N curve fit is 3*N+2 values.

Algorithm Summarized

Here is a summary of the algorithm:

First decide on the degree of the fit you want. Call it N.
Ensure you have storage space for 3*N+2 values and initialize them all to zero. These represent the (N+1)*2-1 values needed for the $A^TA$ matrix values, as well as the N+1 values needed for the $A^TY$ vector.
For each data point you get, you will need to update both the matrix values, as well as the vector valuess. (Note that there are not the same number of values in ATA and ATY!)
- for(i in ATA) ATA[i] += x^i
- for(i in ATY) ATY[i] += x^i*y
When it’s time to calculate the coefficients of your polynomial, convert the ATA values back into the $A^TA$ matrix, invert it and multiply that by the $A^TY$ value.

Pretty simple right?

Not Having Enough Points

When working through the mathier version of this algorithm, you may have noticed that if we were trying to fit using a degree N curve, that we needed N+1 data points at minimum for the math to even be able to happen.

So, you might ask, what happens in the real world, such as in a pixel shader, where we decide to do a cubic fit, but end up only getting 1 data point, instead of the 4 minimum that we need?

Well, first off, if you use the programmer friendly method of incrementally updating ATA and ATY, you’ll end up with an uninvertible matrix (0 determinant), but that doesn’t really help us any besides telling us when we don’t have enough data.

There is something pretty awesome hiding here though. Let’s look at the ATA matrix and ATY values from our quadratic example again.

$A^TA = \begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

$A^TY = \begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

The above values are for a quadratic fit. What if we wanted a linear fit instead? Well… the upper left 2×2 matrix in ATA is the ATA matrix for the linear fit! Also, the first two values in the ATY vector is the ATY vector if we were doing a linear fit.

$A^TA = \begin{bmatrix} 4 & 10 \\ 10 & 30 \\ \end{bmatrix}$

$A^TY = \begin{bmatrix} 102 & 330 \\ \end{bmatrix}$

You can verify that the linear fit above is correct if you want, but let’s take it down another degree, down to approximating the fit with a point. They become scalars instead of matrices and vectors:

$A^TA = 4 \\ A^TY = 102$

If we take the inverse of ATA and multiply it by ATY, we get:

$1/4 * 102 = 25.5$

if you average the Y values of our input data, you’ll find that it is indeed 25.5, so we have verified that it does give us a degree 0 fit.

This is neat and all, but how can we KNOW if we’ve collected enough data or not? Do we just try to invert our ATA matrix, and if it fails, try one degree lower, repeatedly, until we succeed or fail at a degree 0 approximation? Do we maybe instead store a counter to keep track of how many points we have seen?

Luckily no, and maybe you have already put it together. The first value in the ATA array actually TELLS you how many points you have been given. You can use that to decide what degree you are going to have to actually fit the data set to when it’s time to calculate your coefficients, to avoid the uninvertible matrix and still get your data fit.

Interesting Tid Bits

Something pretty awesome about this algorithm is that it can work in a multithreaded fashion very easily. One way would be to break apart the work into multiple job threads, have them calculate ATA and ATY independently, and then sum them all together on the main thread. Another way to do it would be to let all threads share the same ATA and ATY storage, but to use an atomic add operation to update them.

Going the atomic add route, I think this could be a relatively GPU friendly algorithm. You could use actual atomic operations in your shader code, or you could use alpha blending to add your result in.

Even though we saw it in the last section, I’ll mention it again. If you do a degree 0 curve fit to data points (aka fitting a point to the data), this algorithm is mathematically equivalent to just taking the average y value. The ATA values will have a single value which is the sum of the x values to the 0th degree, so will be the count of how many x items there are. The ATY values will also have only a single value, which will be the sum of the x^0*y values, so will be the sum of the y values. Taking the inverse of our 1×1 ATA matrix will give us one divided by how many items there are, so when we multiply that by the ATA vector which only has one item, it will be the same as if we divided our Y value sum by how many data points we had. So, in a way, this algorithm seems to be some sort of generalization of averaging, which is weird to me.

Another cool thing: if you have the minimum number of data points for your degree (aka degree+1 data points) or fewer, you can actually use the ATA and ATY values to get back your ORIGINAL data points – both the X and the Y values! I’ll leave it as an exercise for you, but if you look at it, you will always have more equations than you do unknowns.

If reconstructing the original data points is important to you, you could also have this algorithm operate in two modes.

Basically, always use the ATA[0] storage space to count the number of data points you’ve been given, since that is it’s entire purpose. You can then use the rest of the storage space as RAW data storage for your 2d input values. As soon as adding another value would cause you to overflow your storage, you could process your data points into the correct format of storing just ATA and ATY values, so that from then on, it was an approximation, instead of explicit point storage.

When decoding those values, you would use the ATA[0] storage space to know whether the rest of the storage contained ATA and ATY values, or if they contained data points. If they contained data points, you would also know how many there were, and where they were in the storage space, using the same logic to read data points out as you used to put them back in – basically like saying that the first data point goes immediately after ATA[0], the second data point after that, etc.

The last neat thing, let’s say that you are in a pixel shader as an exmaple, and that you wanted to approximate 2 different values for each pixel, but let’s say that the X value was always the same for these data sets – maybe you are approximating two different values over depth of the pixel for instance so X of both data points is the depth, but the Y values of the data points are different.

If you find yourself in a situation like this, you don’t actually need to pay the full cost of storage to have a second approximation curve.

Since the ATA values are all based on powers of the x values only, the ATA values would be the same for both of these data sets. You only need to pay the cost of the ATY values for the second curve.

This means that fitting a curve costs an initial 3*degree+2 in storage, but each additional curve only costs degree+1 in storage.

Also, since the ATA storage for a curve of degree N also contains the same values used for a curve of degree N-1, N-2, etc, you don’t have to use the same degree when approximating multiple values using the same storage. Your ATA just has to be large enough to hold the largest degree curve, and then you can have ATY values that are sized to the degree of the curve you want to use to approximate each data set.

This way, if you have limited storage, you could perhaps cubically fit one data set, and then linearly fit another data set where accuracy isn’t as important.

For that example, you would pay 11 values of storage for the cubic fit, and then only 2 more values of storage to have a linear fit of some other data.

Example Code

There is some example code below that implements the ideas from this post.

The code is meant to be clear and readable firstly, with being a reasonably decent implementation second. If you are using this in a place where you want high precision and/or high speeds, there are likely both macro and micro optimizations and code changes to be made. The biggest of these is probably how the matrix is inverted.

You can read more on the reddit discussion: Reddit: Incremental Least Squares Curve Fitting

Here’s a run of the example code:

Here is the example code:

#include <stdio.h>
#include <array>

//====================================================================
template<size_t N>
using TVector = std::array<float, N>;

template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

template<size_t N>
using TSquareMatrix = TMatrix<N,N>;

typedef TVector<2> TDataPoint; 

//====================================================================
template <size_t N>
float DotProduct (const TVector<N>& A, const TVector<N>& B)
{
    float ret = 0.0f;
    for (size_t i = 0; i < N; ++i)
        ret += A[i] * B[i];
    return ret;
}

//====================================================================
template <size_t M, size_t N>
void TransposeMatrix (const TMatrix<M, N>& in, TMatrix<N, M>& result)
{
    for (size_t j = 0; j < M; ++j)
        for (size_t k = 0; k < N; ++k)
            result[k][j] = in[j][k];
}

//====================================================================
template <size_t M, size_t N>
void MinorMatrix (const TMatrix<M, N>& in, TMatrix<M-1, N-1>& out, size_t excludeI, size_t excludeJ)
{
    size_t destI = 0;
    for (size_t i = 0; i < N; ++i)
    {
        if (i != excludeI)
        {
            size_t destJ = 0;
            for (size_t j = 0; j < N; ++j)
            {
                if (j != excludeJ)
                {
                    out[destI][destJ] = in[i][j];
                    ++destJ;
                }
            }
            ++destI;
        }
    }
}

//====================================================================
template <size_t M, size_t N>
float Determinant (const TMatrix<M,N>& in)
{
    float determinant = 0.0f;
	TMatrix<M - 1, N - 1> minor;
    for (size_t j = 0; j < N; ++j)
    {
        MinorMatrix(in, minor, 0, j);

        float minorDeterminant = Determinant(minor);
        if (j % 2 == 1)
            minorDeterminant *= -1.0f;

        determinant += in[0][j] * minorDeterminant;
    }
    return determinant;
}

//====================================================================
template <>
float Determinant<1> (const TMatrix<1,1>& in)
{
	return in[0][0];
}

//====================================================================
template <size_t N>
bool InvertMatrix (const TSquareMatrix<N>& in, TSquareMatrix<N>& out)
{
    // calculate the cofactor matrix and determinant
    float determinant = 0.0f;
    TSquareMatrix<N> cofactors;
    TSquareMatrix<N-1> minor;
    for (size_t i = 0; i < N; ++i)
    {
        for (size_t j = 0; j < N; ++j)
        {
            MinorMatrix(in, minor, i, j);

            cofactors[i][j] = Determinant(minor);
            if ((i + j) % 2 == 1)
                cofactors[i][j] *= -1.0f;

            if (i == 0)
                determinant += in[i][j] * cofactors[i][j];
        }
    }

    // matrix cant be inverted if determinant is zero
    if (determinant == 0.0f)
        return false;

    // calculate the adjoint matrix into the out matrix
    TransposeMatrix(cofactors, out);

    // divide by determinant
    float oneOverDeterminant = 1.0f / determinant;
    for (size_t i = 0; i < N; ++i)
        for (size_t j = 0; j < N; ++j)
            out[i][j] *= oneOverDeterminant;
    return true;
}

//====================================================================
template <>
bool InvertMatrix<2> (const TSquareMatrix<2>& in, TSquareMatrix<2>& out)
{
    float determinant = Determinant(in);
    if (determinant == 0.0f)
        return false;

    float oneOverDeterminant = 1.0f / determinant;
    out[0][0] =  in[1][1] * oneOverDeterminant;
    out[0][1] = -in[0][1] * oneOverDeterminant;
    out[1][0] = -in[1][0] * oneOverDeterminant;
    out[1][1] =  in[0][0] * oneOverDeterminant;
    return true;
}

//====================================================================
template <size_t DEGREE>  // 1 = linear, 2 = quadratic, etc
class COnlineLeastSquaresFitter
{
public:
    COnlineLeastSquaresFitter ()
    {
        // initialize our sums to zero
        std::fill(m_SummedPowersX.begin(), m_SummedPowersX.end(), 0.0f);
        std::fill(m_SummedPowersXTimesY.begin(), m_SummedPowersXTimesY.end(), 0.0f);
    }

    void AddDataPoint (const TDataPoint& dataPoint)
    {
        // add the summed powers of the x value
        float xpow = 1.0f;
        for (size_t i = 0; i < m_SummedPowersX.size(); ++i)
        {
            m_SummedPowersX[i] += xpow;
            xpow *= dataPoint[0];
        }

        // add the summed powers of the x value, multiplied by the y value
        xpow = 1.0f;
        for (size_t i = 0; i < m_SummedPowersXTimesY.size(); ++i)
        {
            m_SummedPowersXTimesY[i] += xpow * dataPoint[1];
            xpow *= dataPoint[0];
        }
    }

    bool CalculateCoefficients (TVector<DEGREE+1>& coefficients) const
    {
        // initialize all coefficients to zero
        std::fill(coefficients.begin(), coefficients.end(), 0.0f);

        // calculate the coefficients
        return CalculateCoefficientsInternal<DEGREE>(coefficients);
    }

private:

    template <size_t EFFECTIVEDEGREE>
    bool CalculateCoefficientsInternal (TVector<DEGREE + 1>& coefficients) const
    {
        // if we don't have enough data points for this degree, try one degree less
        if (m_SummedPowersX[0] <= EFFECTIVEDEGREE)
            return CalculateCoefficientsInternal<EFFECTIVEDEGREE - 1>(coefficients);

        // Make the ATA matrix
        TMatrix<EFFECTIVEDEGREE + 1, EFFECTIVEDEGREE + 1> ATA;
        for (size_t i = 0; i < EFFECTIVEDEGREE + 1; ++i)
            for (size_t j = 0; j < EFFECTIVEDEGREE + 1; ++j)
                ATA[i][j] = m_SummedPowersX[i + j];

        // calculate inverse of ATA matrix
        TMatrix<EFFECTIVEDEGREE + 1, EFFECTIVEDEGREE + 1> ATAInverse;
        if (!InvertMatrix(ATA, ATAInverse))
            return false;

        // calculate the coefficients for this degree. The higher ones are already zeroed out.
        TVector<EFFECTIVEDEGREE + 1> summedPowersXTimesY;
        std::copy(m_SummedPowersXTimesY.begin(), m_SummedPowersXTimesY.begin() + EFFECTIVEDEGREE + 1, summedPowersXTimesY.begin());
        for (size_t i = 0; i < EFFECTIVEDEGREE + 1; ++i)
            coefficients[i] = DotProduct(ATAInverse[i], summedPowersXTimesY);
        return true;
    }

    // Base case when no points are given, or if you are fitting a degree 0 curve to the data set.
    template <>
    bool CalculateCoefficientsInternal<0> (TVector<DEGREE + 1>& coefficients) const
    {
        if (m_SummedPowersX[0] > 0.0f)
            coefficients[0] = m_SummedPowersXTimesY[0] / m_SummedPowersX[0];
        return true;
    }

    // Total storage space (# of floats) needed is 3 * DEGREE + 2
    // Where y is number of values that need to be stored and x is the degree of the polynomial
    TVector<(DEGREE + 1) * 2 - 1>   m_SummedPowersX;
    TVector<DEGREE + 1>             m_SummedPowersXTimesY;
};

//====================================================================
template <size_t DEGREE>
void DoTest(const std::initializer_list<TDataPoint>& data)
{
	printf("Fitting a curve of degree %zi to %zi data points:n", DEGREE, data.size());

    COnlineLeastSquaresFitter<DEGREE> fitter;

	// show data
    for (const TDataPoint& dataPoint : data)
		printf("  (%0.2f, %0.2f)n", dataPoint[0], dataPoint[1]);

	// fit data
    for (const TDataPoint& dataPoint : data)
        fitter.AddDataPoint(dataPoint);

	// calculate coefficients if we can
	TVector<DEGREE+1> coefficients;
	bool success = fitter.CalculateCoefficients(coefficients);
	if (!success)
	{
		printf("ATA Matrix could not be inverted!n");
		return;
	}

	// print the polynomial
	bool firstTerm = true;
	printf("y = ");
    bool showedATerm = false;
	for (int i = (int)coefficients.size() - 1; i >= 0; --i)
	{
		// don't show zero terms
		if (std::abs(coefficients[i]) < 0.00001f)
			continue;

        showedATerm = true;

		// show an add or subtract between terms
		float coefficient = coefficients[i];
		if (firstTerm)
			firstTerm = false;
		else if (coefficient >= 0.0f)
			printf(" + ");
		else
		{
			coefficient *= -1.0f;
			printf(" - ");
		}

		printf("%0.2f", coefficient);

		if (i > 0)
			printf("x");

		if (i > 1)
			printf("^%i", i);
	}
    if (!showedATerm)
        printf("0");
	printf("nn");
}

//====================================================================
int main (int argc, char **argv)
{
	// Point - 1 data points
    DoTest<0>(
        {
            TDataPoint{ 1.0f, 2.0f },
        }
    );

	// Point - 2 data points
    DoTest<0>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
        }
    );

	// Linear - 2 data points
    DoTest<1>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
        }
    );

	// Linear - 3 colinear data points
    DoTest<1>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
            TDataPoint{ 3.0f, 6.0f },
        }
    );

	// Linear - 3 non colinear data points
    DoTest<1>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
            TDataPoint{ 3.0f, 5.0f },
        }
    );

	// Quadratic - 3 colinear data points
    DoTest<2>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
            TDataPoint{ 3.0f, 6.0f },
        }
    );

	// Quadratic - 3 data points
    DoTest<2>(
        {
            TDataPoint{ 1.0f, 5.0f },
            TDataPoint{ 2.0f, 16.0f },
            TDataPoint{ 3.0f, 31.0f },
        }
    );

	// Cubic - 4 data points
    DoTest<3>(
        {
            TDataPoint{ 1.0f, 5.0f },
            TDataPoint{ 2.0f, 16.0f },
            TDataPoint{ 3.0f, 31.0f },
            TDataPoint{ 4.0f, 16.0f },
        }
    );

	// Cubic - 2 data points
    DoTest<3>(
        {
            TDataPoint{ 1.0f, 7.0f },
            TDataPoint{ 3.0f, 17.0f },
        }
    );

	// Cubic - 1 data point
    DoTest<3>(
        {
            TDataPoint{ 1.0f, 7.0f },
        }
    );

	// Cubic - 0 data points
    DoTest<3>(
        {
        }
    );

    system("pause");
    return 0;
}

Feedback

There’s some interesting feedback on twitter.

Links

Here’s an interactive demo to let you get a feel for how least squares curve fitting behaves:
Least Squares Curve Fitting

Wolfram Mathworld: Least Squares Fitting
Wikipedia: Least Squares Fitting
Inverting a 2×2 Matrix
Inverting Larger Matrices

A good online polynomial curve fitting calculator

By the way, the term for an algorithm which works incrementally by taking only some of the data at a time is called an “online algorithm”. If you are ever in search of an online algorithm to do X (whatever X may be), using this term can be very helpful when searching for such an algorithm, or when asking people if such an algorithm exists (like on stack exchange). Unfortunately, online is a bit overloaded in the modern world, so it can also give false hits (;
Wikipedia: Online algorithm

Evaluating Polynomials with the GPU Texture Sampler

This is an extension of a paper I wrote which shows how to use the linear texture sampling capabilities of the GPU to calculate points on Bezier curves. You store the control points in the texture, then sample along the texture’s diagonal to get points on the curve:
GPU Texture Sampler Bezier Curve Evaluation

I’ve been thinking about the items in the “future work” section and found some interesting things regarding polynomials, logic gates, surfaces and volumes. This is the first post, which deals with evaluating polynomials.

Evaluating Polynomials

One of the main points of my paper was that N-linear interpolation (linear, bilinear, trilinear, etc) can be used to evaluate the De Casteljau algorithm since both things are just linear interpolations of linear interpolations. (Details on bilinear interpolation here: Bilinear Filtering & Bilinear Interpolation).

This meant that it was also able to calculate Bernstein Polynomials (aka the algebraic form of Bezier curves), since Bernstein polynomials are equivalent to the De Casteljau algorithm.

I started looking around to see what would happen if you messed around with the De Casteljau algorithm a bit, like interpolate at one level by
$t^2$ or $t*0.5+0.5$ or by a constant or by another variable completely. My hope was that I’d be able to make the technique more generic and open it up to a larger family of equations, so people weren’t limited to just Bernstein polynomials.

That opened up a pretty deep rabbit hole on polynomial blossoming and something called Symmetric Multiaffine Functions. There are some great links in the answer here:
Math Stack Exchange: Modifying and Generalizing the De Casteljau Algorithm

In the end, it turned out to be pretty simple though. It turns out that any polynomial can be converted back and forth from “Power Basis” (which looks like $Ax^2+Bx+C$ ) to “Bernstein Basis” (which looks like $A(1-t)^2+B(1-t)t+Ct^2$ ) so long as they are the same degree.

This isn’t the result I was expecting but it is a nice result because it’s simple. I think there is more to be explored by sampling off the diagonal, and using different t values at different stages of interpolation, but this result is worth sharing.

By the way, you could also use curve fitting to try and approximate a higher degree function with a lower degree one, but for this post, I’m only going to be talking about exact conversion from Bernstein polynomials to Power polynomials.

Since we can convert power basis polynomials to Bernstein polynomials, and the technique already works for Bernstein polynomials, that means that if we have some random polynomial, say $y=2x^3+4x+2$ , that we can make this technique work for that too. The technique got a little closer to arbitrary equation evaluation. Neat!

Converting Power Basis to Bernstein Basis

I found the details of the conversion process at Polynomial Evaluation and Basis Conversion which was linked to by Math Stack Exchange: Convert polynomial curve to Bezier Curve control points.

This is best explained working through examples, so let’s start by converting a quadratic polynomial from power basis to Bernstein basis.

Quadratic Function

$y=2x^2+8x+3$

The first thing we do is write the coefficients vertically, starting with the $x^0$ coefficient, then the $x^1$ coefficient and continuing on to the highest value $x^n$ :

$\begin{array}{c} 3 \\ 8 \\ 2 \\ \end{array}$

Next, we need to divide by the Binomial Coefficients (aka the row of Pascal’s Triangle which has the same number of items as we have coefficients). In this case we need to divide by: 1,2,1.

$\begin{array}{c|c} 3 & 3 / 1 = 3 \\ 8 & 8 / 2 = 4 \\ 2 & 2 / 1 = 2 \\ \end{array}$

Now we generate a difference table backwards. it’s hard to explain what that is in words, but if you notice, each value is the sum of the value to the left of it, and the one below that.

$\begin{array}{c|c|c|c} 3 & 3 / 1 = 3 & 7 & 13 \\ 8 & 8 / 2 = 4 & 6 & \\ 2 & 2 / 1 = 2 & & \\ \end{array}$

We are all done. The control points for the Bezier curve are on the top row (ignoring the left most column). They are 3,7,13 which makes it so we have the following two equations being equal. The first is in power basis, the second is in Bernstein basis.

$y=2x^2+8x+3$
$y=3(1-x)^2+14(1-x)x+13x^2$

Note: don’t forget that Bezier curves multiply the control points by the appropriate row in Pascal’s triangle. That’s where the 14 comes from in the middle term of the Bernstein polynomial. We are multiplying the control points 3,7,13 by the row in Pascal’s triangle 1,2,1 to get the final coefficients of 3,14,13.

Let’s have Wolfram Alpha help us verify that they are equal.

Wolfram Alpha: graph y=2x^2+8x+3, y=3*(1-x)^2+14x*(1-x)+13x^2, from 0 to 1

Yep, they are equal! If you notice the legend of the graph, wolfram actually converted the Bernstein form back to power basis, and you can see that they are exactly equivalent.

You can also write the Bernstein form like the below, which i prefer, using $t$ instead of $x$ and also setting $s=1-t$ .

$y=3s^2+14st+13t^2$

Cubic Function

A cubic function is not that much harder than a quadratic function. After this, you should see the pattern and be able to convert any degree easily.

$y=5x^3+9x-4$

Again, the first thing we do is write the coefficients vertically, starting with the constant term. Note that we don’t have an $x^2$ term, so it’s coefficient is 0.

$\begin{array}{c} -4 \\ 9 \\ 0 \\ 5 \\ \end{array}$

We next divide by the Pascal’s triangle row 1,3,3,1.

$\begin{array}{c|c} -4 & -4 / 1 = -4 \\ 9 & 9 / 3 = 3 \\ 0 & 0 / 3 = 0 \\ 5 & 5 / 1 = 5 \\ \end{array}$

Now, make the difference table going backwards again:

$\begin{array}{c|c|c|c|c} -4 & -4 / 1 = -4 & -1 & 2 & 10 \\ 9 & 9 / 3 = 3 & 3 & 8 & \\ 0 & 0 / 3 = 0 & 5 & & \\ 5 & 5 / 1 = 5 & & & \\ \end{array}$

Our Bezier control points are along the top: -4,-1,2,10. Keeping in mind that the coefficients for a cubic bezier curve are multiplied by 1,3,3,1 we can make the Bernstein form and put it next to our original formula:

$y=5x^3+9x-4$
$y=-4(1-x)^3-3(1-x)^2x+6(1-x)x^2+10x^3$

Let’s check in wolfram alpha again:
Wolfram Alpha: graph y=5x^3+9x-4, y=-4(1-x)^3-3x(1-x)^2+6x^2(1-x)+10x^3, from 0 to 1

And here it is in the cleaner form:

$y=-4s^3-3s^2t+6st^2+10t^3$

Some Notes On Calculating Polynomials with the Texture Sampler

You may notice that in the comparison graphs i only plotted the graphs from 0 to 1 on the x axis (aka the t axis). The equations are actually equivalent outside of that range as well, but the technique from my paper only works from the 0 to 1 range because it relies on built in hardware pixel interpolation. This may sound like a big limitation, but if you know the minimum and maximum value of x that you want to plug into your equation at runtime, you can convert your x into a percent between those values, get the resulting polynomial, convert it to Bernstein form, set up the texture, and then at runtime convert your input parameter into that percent when you do the lookup. In other words, you squeeze the parts of the function you care about into the 0 to 1 range.

Another issue you will probably hit is that standard RGBA8 textures have only 8 bits per channel and can only store values between 0 and 1. Since the texture is supposed to be storing your control points, that is bad news.

One way to get around this is to find the largest coefficient value and divide the others by this value. This will put the coefficients into the 0 to 1 range, which will be able to be stored in your texture. After sampling the texture, you multiply the result by that scaling value to get the correct answer.

Scaling won’t help having both negative and positive coefficients though. To handle negative coefficients, you could map the 0-1 space to be from -1 to 1, similar to how we often do it with normal maps and other signed data stored in textures. After doing the lookup you’d have to unmap it too of course.

You could also solve negative values and scaling problems by squishing the y axis into the 0 to 1 space by subtracting the minimum and dividing by the maximum minus the minimum, similarly to how we squished the x range into 0 to 1.

If you instead move to an RGBAF32 texture, you’ll have a full 32 bit float per color channel and won’t have problems with either large values or negative values. You will still have to deal with x only going from 0 to 1 though.

I also want to mention that the hardware texture interpolation works in a X.8 fixed point format. There are more details in my paper, but that means that you’ll get some jagged looking artifacts on your curve instead of a smoothly varying value. If that is a problem for you in practice, my paper talks about a few ways to mitigate that issue.

Before moving on, I wanted to mention that it’s easy to support rational polynomials using this method as well. A rational polynomial is when you divide one polynomial by another polynomial, and relates to rational Bezier curves, where you divide one curve by another curve (aka you give weights to control points). Rational curves are more powerful and in fact you can perfectly represent sine and cosine with a quadratic rational polynomial. More info on that in my paper.

To calculate rational polynomials, you just encode the numerator polynomial in one color channel, and the denominator polynomial in another color channel. After you sample the texture and get the result of your calculation, you divide the numerator value by the denominator value. It costs one division in your shader code, but that’s pretty cheap for the power it gives you!

Regarding the texture size requirements to store a polynomial of a specific degree…

Every dimension of the texture, and every color channel in that texture, adds a degree.

However, to get the benefit of the degree increase from the color channel, you need to do a little more math in the shader – check my paper for more details!

So, if you wanted to store a quadratic polynomial in a texture, you would need either a 2d texture with 1 color channel, or you could do it with a 1d texture that had 2 color channels.

If you wanted to store a cubic polynomial in a texture, you could use a 3d texture with 1 color channel, or a 2d texture with two color channels (there would be some waste here) or a 1d texture with three color channels.

For a polynomial that had a maximum degree term of 6, you could use a 3d volume texture that had 3 color channels: RGB.

If you need to evaluate a very high degree polynomial, you can actually take multiple texture samples and combine them.

For instance, if you had a 2d texture with a single color channel, you could do a single texture read to get a quadratic.

If you did two texture reads, you would have two quadratics.

If you linearly interpolated between those two quadratics, you would end up with a cubic.

That isn’t a very high degree curve but is easier to grasp how they combine.

Taking this up to RGBA 3d volume textures, a single texture read will get you a curve of degree 6. If you do another read, it will take it to degree 7. Another read gets you to 8, another to 9, etc.

With support for 4d textures, an RGBA texture read would give you a degree 7 curve. Another read would boost it to 8, another to 9, another to 10, etc.

Regarding the specific sizes of the textures, in all cases the texture size is “2” on each dimension because we are always just linearly interpolating within a hyper cube of pixel values. You can increase the size of the texture for piecewise curves, check out the paper for more details on that and other options.

Closing

Hopefully you found this useful or interesting!

There may not have been much new information in here for the more math inclined people, but I still think it’s worth while to explicitly show how the technique works for both Bernstein polynomials as well as the more common power basis polynomials.

I still think it would be interesting to look at what happens when you sample off of the diagonal, and also what happens if you use different values at different stages of the interpolation. As an example, instead of just looking up a texture at (t,t) for the (u,v) value to get a quadratic curve point, what if we look up by (t,t^2)? At first blush, it seems like by doing that we may be able to boost a curve to a higher degree, maybe at the cost of some reduced flexibility for the specific equations we can evaluate?

Next up I’ll be writing up some more extensions to the paper involving logic gates, surfaces, and volumes.

Have any feedback, questions or interesting ideas? Let me know!

The Secret to Writing Fast Code / How Fast Code Gets Slow

This is a “soft tech” post. If that isn’t your thing, don’t worry, I’ll be returning to some cool “hard tech” and interesting algorithms after this. I’ve been abusing the heck out of the GPU texture sampler lately, so be on the lookout for some posts on that soon (;

I’m about to show you some of the fastest code there is. It’s faster than the fastest real time raytracer, it’s faster than Duff’s Device.

Heck, despite the fact that it runs on a classical computer, it runs faster than Shor’s Algorithm which uses quantum computing to factor integers so quickly that it breaks modern cryptographic algorithms.

This code also runs faster than Grover’s Algorithm which is another quantum algorithm that can search an unsorted list in O(sqrt(N)).

Even when compiled in debug it runs faster than all of those things.

Are you ready? here it is…

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
    return 0;
}

Yes, the code does nothing and that is precisely why it runs so fast.

The Secret to Writing Fast Code

The secret to writing fast code, no matter what you are writing is simple: Don’t do anything that is too slow.

Follow me on a made up example to see what I’m talking about.

Let’s say you started with a main() function like i showed above and you decided you want to make a real time raytracer that runs on the CPU.

First thing you do is figure out what frame rate you want it to run at, at the desired resolution. From there, you know how many milliseconds you have to render each frame, and now you have a defined budget you need to stay inside of. If you stay in that budget, you’ll consider it a real time raytracer. If you go outside of that budget, it will no longer be real time, and will be a failed program.

You may get camera control working and primary rays intersecting a plane, and find you’ve used 10% of your budget and 90% of the budget remains. So far so good.

Next up you add some spheres and boxes, diffuse and specular shade them with a directional light and a couple point lights. You find that you’ve used 40% of your budget, and 60% remains. We are still looking good.

Next you decide you want to add reflection and refraction, allowing up to 3 ray bounces. You find you are at 80% of your budget and are still looking good. We are still running fast enough to be considered real time.

Now you say to yourself “You know what? I’m going to do 4x super sampling for anti aliasing!”, so you shoot 4 rays out per pixel instead of 1 and average them.

You profile and uh oh! You are at 320% of your budget! Your ray tracer is no longer real time!

What do you do now? Well, hopefully it’s obvious: DON’T DO THAT, IT’S TOO SLOW!

So you revert it and maybe drop in some FXAA as a post processing pass on your render each frame. Now you are at 95% of your budget let’s say.

Now you may want to add another feature, but with only 5% of your budget left you probably don’t have much performance to spare to do it.

So, you implement whatever it is, find that you are at 105% of your budget.

Unlike the 4x super sampling, which was 220% overbudget, this new feature being only 5% over budget isn’t THAT much. At this point you could profile something that already exists (maybe even your new feature) and see if you can improve it’s performance, or if you can find some clever solution that gives you a performance boost, at the cost of things you don’t care about, you can do that to get some performance back. This is a big part of the job as a successful programmer / software engineer – make trade offs where you gain benefits you care about, at the cost of things you do not care about.

At this point, you can also decide if this new feature is more desired than any of the existing features. If it is, and you can cut an old feature you don’t care about anymore, go for it and make the trade.

Rinse and repeat this process with new features and functionality until you have the features you want, that fit within the performance budget you have set.

Follow this recipe and you too will have your very own real time raytracer (BTW related:Making a Ray Traced Snake Game in Shadertoy).

Maintaining a performance budget isn’t magic. It’s basically subtractive synthesis. Carve time away from your performance budget by adding a feature, then optimize or remove features if you are over budget. Rinse and repeat until the sun burns out.

Ok, so if it’s so easy, why do we EVER have performance problems?

How Fast Code Gets Slow

Performance problems come up when we are not paying attention. Sometimes we cause them for ourselves, and sometimes things outside of our control cause them.

The biggest way we cause performance problems for ourselves is by NOT MEASURING.

If you don’t know how your changes affect performance, and performance is something you care about, you are going to have a bad time.

If you care about performance, measure performance regularly! Profile before and after your changes and compare the differences. Have automated tests that profile your software and report the results. Understand how your code behaves in the best and worst case. Watch out for algorithms that sometimes take a lot longer than their average case. Stable algorithms make for stable experiences (and stable frame rates in games). This is because algorithms that have “perf spikes” sometimes line up on the same frame, and you’ll have more erratic frame rate, which makes your game seem much worse than having a stable but lower frame rate.

But, again, performance problems aren’t always the programmers fault. Sometimes things outside of our control change and cause us perf problems.

Like what you might ask?

Well, let’s say that you are tasked with writing some very light database software which keeps track of all employee’s birthdays.

Maybe you use a hash map to store birthdays. The key is the string of the person’s name, and the value is a unix epoch timestamp.

Simple and to the point. Not over-engineered.

Everything runs quickly, your decisions about the engineering choices you made were appropriate and your software runs great.

Now, someone else has a great idea – we have this database software you wrote, what if we use it to keep track of all of our customers and end user birthdays as well?

So, while you are out on vacation, they make this happen. You come back and the “database” software you made is running super slow. There are hundreds of thousands of people stored in the database, and it takes several seconds to look up a single birthday. OUCH!

So hotshot, looks like your code isn’t so fast huh? Actually no, it’s just that your code was used for something other than the original intended usage case. If this was included in the original specs, you would have done something different (and more complex) to handle this need.

This was an exaggerated example, but this sort of thing happens ALL THE TIME.

If you are working on a piece of software, and the software requirements change, it could turn any of your previous good decisions into poor decisions in light of the new realities.

However, you likely don’t have time to go back and re-think and possibly re-work every single thing you had written up to that point. You move onward and upward, a little more heavy hearted.

The target moved, causing your code to rot a bit, and now things are likely in a less than ideal situation. You wouldn’t have planned for the code you have with the info you have now, but it’s the code you do have, and the code you have to stick with for the time being.

Every time that happens, you incur a little more tech debt / code complexity and likely performance problems as well.

You’ll find that things run a little slower than they should, and that you spend more time fighting symptoms with small changes and somewhat arbitrary rules – like telling people not to use name lengths more than 32 characters for maximum performance of your birthday database.

Unfortunately change is part of life, and very much part of software development, and it’s impossible for anyone to fully predict what sort of changes might be coming.

Those changes are often due to business decisions (feedback on product, jockying for a new position in the marketplace, etc), so are ultimately what give us our paychecks and are ultimately good things. Take it from me, who has worked at ~7 companies in 15 years. Companies that don’t change/adapt die.

So, change sucks for our code, but it’s good for our wallets and keeps us employed 😛

Eventually the less than ideal choices of the past affecting the present will reach some threshold where something will have to be done about it. This will likely happen at the point that it’s easier to refactor some code, than to keep fighting the problems it’s creating by being less than ideal, or when something that really NEEDS to happen CAN’T happen without more effort than the refactor would take.

When that happens, the refactor comes in, where you DO get to go back and rethink your decisions, with knowledge of the current realities.

The great thing about the refactor is that you probably have a lot of stuff that your code is doing which it doesn’t really even NEED to be doing.

Culling that dead functionality feels great, and it’s awesome watching your code become simple again. It’s also nice not having to explain why that section of code behaves the way it does (poorly) and the history of it coming to be. “No really, I do know better, but…!!!”

One of the best feelings as a programmer is looking at a complex chunk of code that has been a total pain, pressing the delete key, and getting a little bit closer back to the fastest code in the world:

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
    return 0;
}

PS: Another quality of a successful engineer is being able to constantly improve software as it’s touched. If you are working in an area of code, and you see something ugly that can be fixed quickly and easily, do it while you are there. Since the only constant in software development is change, and change causes code quality to continually degrade, make yourself a force of continual code improvement and help reverse the flow of the code flowing into the trash can.

Engines

In closing, I want to talk about game engines – 3rd party game engines, and re-using an engine from a different project. This also applies to using middleware.

Existing engines are great in that when you and your team know how to use them, you can get things set up very quickly. It lets you hit the ground running.

However, no engine is completely generic. No engine is completely flexible.

That means that when you use an existing engine, there will be some amount of features and functionality which were made without your specific usage case in mind.

You will be stuck in the world where from day 1 you are incurring the tech debt type problems I describe above, but you will likely be unable or unwilling to refactor everything to suit your needs specifically.

I don’t mention this to say that engines are bad. Lots of successful games have used engines made by other people, or re-used engines from previous projects.

However, it’s a different kind of beast using an existing engine.

Instead of making things that suit your needs, and then using them, you’ll be spending your time figuring out how to use the existing puzzle pieces to do what you want. You’ll also be spending time backtracking as you hit dead ends, or where your first cobbled together solution didn’t hold up to the new realities, and you need to find a new path to success that is more robust.

Just something to be aware of when you are looking at licensing or re-using an engine, and thinking that it’ll solve all your problems and be wonderful. Like all things, it comes at a cost!

Using an existing engine does put you ahead of the curve: At day 1 you already have several months of backlogged technical debt!

Unfortunately business realities mean we can’t all just always write brand new engines all the time. It’s unsustainable

Agree / Disagree / Have something to say?

Leave a comment below, or tweet at me on twitter: @Atrix256

Is Code Faster Than Data? Examining Hash Tables

This series of posts is aimed at examining if and how ad hoc code crafted for a specific static (unchanging / constant) data set can run faster than typical generic run time data structures. I think the answer is an obvious “yes, we can often do better”, but these posts explore the details of the problem space and explore how and when we might do better.

The last post explored switch statement performance compared to array access performance.

A switch statement is just a way of telling the compiler how we want to map integral inputs to either some code to run, or some value to return. It’s up to the compiler how to make that happen.

Because compiler makers presumably want to make their compiler generate fast code, it seems like a switch statement should be able to match the speed of an array since a switch statement could very well be implemented as an array when all it is doing is returning different values based on input. Maybe it could even beat an array, in the case of a sparse array, an array with many duplicates, or other situations.

In practice, this doesn’t seem to be the case though, and switch statements are actually quite a bit slower than arrays from the experimentation I’ve done. The main part of the overhead seems to be that it always does a jump (goto) based on the input you are switching on. It can do some intelligent things to find the right location to jump to, but if all you are doing is returning a value, it doesn’t seem smart enough to do a memory read from an array and return that value, instead of doing a jump.

You can read a nice analysis on how switch statements are compiled on the microsoft compiler here: Something You May Not Know About the Switch Statement in C/C++.

Today we are going to be analyzing how hash tables fare against switch statements, arrays, and a few other things.

Testing Details

I ran these tests in x86/x64 debug/release in visual studio 2015.

I got a list of 100 random words from http://www.randomwordgenerator.com/ and made sure they were all lowercase. I associated an integer value with them, from 1 to 100. My tests are all based on the string being the key and the integer being the value.

I have that data stored/represented in several ways for performing lookups:

std::map.
std::unordered_map.
std::unordered_map using crc32 hash function.
std::unordered_map using crc32 hash function modulo 337 and salted with 1147287 to prevent collisions.
SwitchValue() switches on crc32 of input string.
SwitchValueValidate() switches on crc32 of input string but does a single strcmp to handle possibly invalid input.
SwitchValueMinimized() switches on crc32 of input string modulo 337 and salted with 1147287 to prevent collisions.
SwitchValueMinimizedValidate() like SwitchValueMinimized() but does a single strcmp to handle possibly invalid input.
g_SwitchValueMinimizedArray, the array version of SwitchValueMinimized().
g_SwitchValueMinimizedArrayValidate, the array version of SwitchValueMinimizedValidate().
BruteForceByStartingLetter() switches on first letter, then brute force strcmp’s words beginning with that letter.
BruteForce() brute force strcmp’s all words.

The non validating switch statement functions have an __assume(0) in their default case to remove the overhead of testing for invalid values. This is to make them as fast as possible for the cases when you will only be passing valid values in. If ever that contract was broken, you’d hit undefined behavior, so the performance boost comes at a cost. The Validate versions of the switch functions don’t do this, as they are meant to take possibly invalid input in, and handle it gracefully. Both validating and not validating input are common use cases so I wanted to represent both in the performance analysis.

Here are the tests done:

In Order – looks up all strings in order and sums the associated values.
Shuffled – looks up all strings in random order and sums the associated values.
Pre-Hashed Keys In Order – looks up all strings in order and sums the associated values, using pre-hashed keys.
Pre-Hashed Keys Shuffled – looks up all strings in random order and sums the associated values, using pre-hashed keys.

The second two tests only apply to the value lookups which can take pre-hashed keys. For instance, g_SwitchValueMinimizedArray can be indexed by a key that was hashed before the program ran, but a std::unordered_map cannot be indexed by a hash value that was calculated in advance.

Each of those tests were done 5,000 times in a row to make performance differences stand out more, and that full amount of time is the time reported. That process was done 50 times to give both an average (a mean) and a standard deviation to show much much the time samples differed.

The source code for the tests can be found here:
Github: Atrix256/RandomCode/HashVsSwitch

Results

Here are the results, in milliseconds. The values in parentheses are the standard deviations, which are also in milliseconds.

In Order

Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.

	Debug		Release
	Win32	x64	Win32	x64
std::map	7036.77 (126.41)	7070.18 (155.49)	33.02 (2.68)	35.40 (1.43)
std::unordered_map	4235.31 (24.41)	4261.36 (145.16)	19.97 (0.45)	20.12 (0.62)
std::unordered_map crc32	4236.38 (80.72)	4275.36 (116.65)	24.36 (0.47)	23.47 (0.86)
std::unordered_map crc32 minimized	4034.50 (12.72)	4323.67 (170.55)	26.39 (0.50)	23.68 (0.71)
SwitchValue()	123.28 (0.98)	144.29 (4.91)	6.81 (0.30)	5.47 (0.29)
SwitchValueValidate()	127.59 (1.22)	147.41 (5.20)	8.84 (0.35)	7.99 (0.36)
SwitchValueMinimized()	128.83 (0.95)	151.48 (4.66)	8.28 (0.38)	10.18 (0.37)
SwitchValueMinimizedValidate()	132.44 (1.02)	159.85 (6.73)	12.65 (0.40)	10.89 (0.36)
g_SwitchValueMinimizedArray	104.15 (1.13)	122.94 (5.98)	7.68 (0.36)	6.08 (0.36)
g_SwitchValueMinimizedArrayValidate	107.75 (1.07)	120.75 (2.80)	10.49 (0.37)	8.95 (0.32)
BruteForceByStartingLetter()	19.92 (0.63)	22.01 (0.86)	4.85 (0.24)	5.81 (0.26)
BruteForce()	118.65 (1.09)	140.20 (2.28)	31.53 (0.56)	46.47 (0.83)

Shuffled

Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.

	Debug		Release
	Win32	x64	Win32	x64
std::map	7082.92 (214.13)	6999.90 (193.82)	32.14 (0.59)	34.20 (0.62)
std::unordered_map	4155.85 (133.00)	4221.84 (124.70)	20.21 (0.42)	20.09 (0.47)
std::unordered_map crc32	4286.44 (95.39)	4300.81 (64.37)	24.55 (0.57)	23.06 (0.57)
std::unordered_map crc32 minimized	4186.27 (75.35)	4111.73 (43.36)	26.36 (0.56)	23.65 (0.54)
SwitchValue()	127.93 (3.85)	137.63 (1.31)	6.97 (0.32)	5.47 (0.27)
SwitchValueValidate()	131.46 (2.34)	141.38 (1.47)	8.92 (0.38)	7.86 (0.37)
SwitchValueMinimized()	133.03 (2.93)	145.74 (1.50)	9.16 (0.37)	10.50 (0.41)
SwitchValueMinimizedValidate()	135.47 (2.27)	151.58 (1.48)	12.13 (0.40)	10.13 (0.43)
g_SwitchValueMinimizedArray	106.38 (2.70)	118.61 (3.73)	8.18 (0.31)	5.19 (0.29)
g_SwitchValueMinimizedArrayValidate	109.32 (2.34)	120.94 (3.02)	10.49 (0.55)	9.00 (0.40)
BruteForceByStartingLetter()	20.45 (0.92)	21.64 (0.76)	4.90 (0.31)	5.87 (0.32)
BruteForce()	120.70 (2.16)	140.95 (1.71)	32.50 (0.47)	45.90 (0.79)

Pre-hashed In Order

Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.

	Debug		Release
	Win32	x64	Win32	x64
SwitchValue()	12.49 (0.61)	13.23 (0.37)	1.94 (0.17)	1.81 (0.12)
SwitchValueValidate()	17.08 (1.06)	16.72 (0.57)	4.32 (0.30)	4.05 (0.21)
SwitchValueMinimized()	11.83 (0.69)	12.06 (0.51)	1.29 (0.13)	1.58 (0.17)
SwitchValueMinimizedValidate()	16.02 (0.84)	15.84 (0.66)	3.25 (0.24)	3.47 (0.27)
g_SwitchValueMinimizedArray	1.23 (0.06)	1.15 (0.10)	0.00 (0.00)	0.00 (0.00)
g_SwitchValueMinimizedArrayValidate	4.21 (0.32)	2.99 (0.20)	2.45 (0.17)	2.66 (0.20)

Pre-hashed Shuffled

Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.

	Debug		Release
	Win32	x64	Win32	x64
SwitchValue()	12.96 (1.37)	13.45 (0.47)	1.84 (0.11)	1.81 (0.16)
SwitchValueValidate()	16.27 (2.01)	16.57 (0.63)	2.65 (0.19)	2.85 (0.17)
SwitchValueMinimized()	11.75 (0.63)	12.15 (0.45)	1.07 (0.07)	1.06 (0.11)
SwitchValueMinimizedValidate()	16.44 (0.99)	16.33 (0.58)	3.43 (0.18)	3.41 (0.22)
g_SwitchValueMinimizedArray	1.13 (0.06)	1.18 (0.10)	0.32 (0.05)	0.31 (0.04)
g_SwitchValueMinimizedArrayValidate	4.50 (0.32)	3.31 (0.18)	2.82 (0.16)	3.29 (0.18)

Observations

There’s a lot of data, but here’s the things I found most interesting or relevant to what I’m looking at (generic data structures vs ad hoc code for data).

Tests 1 and 2

std::map and std::unordered map are very, very slow in debug as you might expect. It would be interesting to look deeper and see what it is that they are doing in debug to slow them down so much.

There is some tribal knowledge in the C++ world that says to not use std::map and to use std::unordered_map instead, but I was surprised to see just how slow std::map was. in x64 release, std::map took about 75% the time that brute force did, and in win32 release, it took the same time or was slower! std::map isn’t hash based, you give it a comparison function that returns -1,0, or 1 meaning less than, equal or greater than. Even so, you have to wonder how the heck the algorithm can be so slow that brute force is a comparable replacement for lookup times!

It’s interesting to see that everything i tried (except brute force) was significantly faster than both std::map and std::unordered_map. That saddens me a little bit, but to be fair, the usage case I’m going after is a static data structure that has fast lookup speeds, which isn’t what unordered_map aims to solve. This just goes to show that yes, if you have static data that you want fast lookup times for, making ad hoc code or rolling your own read only data structure can give you significant wins to performance, and also can help memory issues (fragmentation and wasted allocations that will never be used).

It was surprising to see that switching on the first letter and brute forcing the strings with the same first letter did so well. That is one of the faster results, competing with SwitchValue() for top dog. The interesting thing though is that BruteForceByStartingLetter() gracefully handles invalid input, while SwitchValue() does not and has undefined behavior, so another point goes to BruteForceByStartingLetter().

Tests 3 and 4

These tests were done with pre-hashed keys to simulate an ideal setup.

If you have a static key to value data structure and have the ability to make ad hoc code for your specific static data, chances are pretty good that you’ll also be able to pre-hash whatever keys you are going to be looking up so you don’t have to hash them at run time. Also, if you are doing multiple lookups with a single key for some reason, you may opt to calculate the hash only on the first lookup, and then from there re-use the hashed key.

These tests simulated those situations.

As expected, the perf results on these tests are much better than those that hash the key on demand for each lookup. Less work done at runtime means better performance.

Based on the results of the last blog post – that array lookups are super fast – you probably aren’t surprised to see that g_SwitchValueMinimizedArray is the winner for performance by far.

It is so fast that the in order case doesn’t even register any time having been taken. This is probably a little misleading, because doing the in order tests (and even the shuffled tests) are very cache friendly. In reality, you probably would have more cache misses and it wouldn’t be quite as cheap as what is being reported, but would still be super fast compared to the other options.

In second place comes SwitchValueMinimized() which is the switch statement function version of g_SwitchValueMinimizedArray. Arrays still beat switch statements, as we found out in the last post!

In third place comes SwitchValue(), which is the same as SwitchValueMinimized() but has sparser values used in the switch statement, which make it more difficult for the compiler to generate efficient code. For instance, having the full range of 32 bits as case statement values, and having them all be pseudo random numbers (because they are the result of a hash!) rules out the ability for the compiler to make a jump table array, or find any patterns in the numbers. The SwitchValueMinimized() function on the other hand has only 337 possible values, and so even though the values are sparse (there are 100 items in those 337 possible values), it’s a small enough number that a jump table array could be used without issues.

After that comes all the validated versions of the tests. It makes sense that they would be slower, because they do all the same work, and then some additional work (strcmp) to ensure that the input is valid.

Getting The Fastest Results

If you have some static data that maps keys to values, and you need it to be fast for doing lookups, it looks like writing something custom is the way to go.

The absolutely fastest way to do it is to make an array out of your data items and then pre-process (or compile time process) any places that do a lookup, to convert keys to array indices. then, at run time, you only need to do an array lookup to a known index to get your data, which is super fast. If your data has duplicates, you might also be able to make the keys which point at duplicate data instead just point at the same array index, to de-duplicate your data.

If doing that is too complex, or too much work, a low tech and low effort way to handle the problem seems to be to break your data up into buckets, possibly based on their first letter, and then doing brute force (or something else) to do the lookup among the fewer number of items.

In fact, that second method is sort of like a hard coded trie which is only two levels deep.

If you needed to do some hashing at runtime, finding a faster hash function (that also worked in constexpr, or at least had a constexpr version!) could help you get closer to the pre-hashed keys timings. The good news is the hash doesn’t have to be particularly good. It just has to be fast and have no collisions for the input values you wish to use. That seems like something where brute force searching simple hash functions with various salt values may give you the ideal solution, but probably would take a very, very long time to find what you are looking for. You might notice that the default hash used for std::unordered_map is actually faster than the crc32 implementation I used.

Of course, we also learned what NOT to do. Don’t use brute force, and don’t use std::map. Using std::unordered_map isn’t super aweful compared to those solutions, but you can do a lot better if you want to.

Why This?

This fast key to value lookup might sound like a contrived need to some people, but in game development (I am a game developer!), there is commonly the concept of a game database, where you look up data about things (how much damage does this unit do?) by looking up a structure based on a unique ID that is a string, named by a human. So, in game dev, which also has high performance needs, optimizing this usage case can be very helpful. There is a little bit more talk about game data needs here: Game Development Needs Data Pipeline Middleware.

Is Code Faster Than Data?

I still think ad hoc code for data structures can often be faster than generic data structures, and the experiments on this post have positive indications of that.

Another way I think ad hoc code could be helpful is when you have hierarchical and/or heterogeneous data structures. By that I mean data structures which have multiple levels, where each level may actually have different needs for how best to look up data in it, and in fact, even siblings on the same level maybe have different needs for how best to look up data in it.

In these cases, you could make some data types which had virtual functions to handle the nature of the data needing different solutions at each level, but those virtual function calls and abstractions add up.

I think it’d be superior to have hard coded code that says “oh, you want index 4 of the root array? ok, that means you are going to binary search this list next”. Of course, that code needs to be generated by another program to be effective. If a human has to make sure all that code stays up to date, it’ll be a ton of work, and it’ll be broken, making very subtle hard to reproduce bugs.

A downside I can see to ad hoc code solutions is possibly thrashing the instruction cache more. Not sure if that’d be an issue in practice, it’d be interesting to try more complex data structures and see how it goes.

Also, it might be difficult to have good heuristics to figure out what is best in which situations. I could see a utility possibly generating different variations of code and running them to see which was most performant. Seems like it’d be a challenge to get 100% right all the time, but our experiments make it seems like it’d be easy to do significantly better than generic algorithms which are meant to be dynamic at runtime.

I also think that more complex data structures are more likely to get benefit of having custom code made for them. Simple ones less likely so. It’s hard to beat an array lookup. That being said, the unbeatable simple data structures make great building blocks for the more complex ones (;

It probably would also be good to look into memory usage a bit more to see how ad hoc code compares to generic algorithms. If ad hoc code is much faster but uses more memory, that’d have to be a conscious decision to make when weighing the trade offs.

Maybe in the future, the C++ standards will allow for static data structure types that you have to initialize with compile time constants (allowing constexpr), that are optimized for lookup times since they can’t have any inserts or deletes? I wonder how much demand there would be for that?

Here’s a good read on some other string friendly data structures:
Data Structures for Strings

Twitter user @ores brought up two interesting points:

It would be interesting to see gperf performs in this situation. If makes a faster minimal perfect hash function, it’ll get us closer to the pre-hashed keys timings.
It would be interesting to time scripting languages to see if for them code is faster than data or not. Another interesting aspect of this would be to look at a JIT compiled scripting language like lua-jit. The thing that makes JIT interesting is that it can compile for your specific CPU, instead of having to compile for a more generic set of CPU features. That gives it the opportunity to make code which will perform better on your specific machine.

Who Cares About Dynamic Array Growth Strategies?

Let’s say that you’ve allocated an array of 20 integers and have used them all. Now it’s time to allocate more, but you aren’t quite sure how many integers you will need in the end. What do you do?

Realloc is probably what you think of first for solving this problem, but let’s ignore that option for the moment. (If you haven’t used realloc before, give this a read! Alloca and Realloc – Useful Tools, Not Ancient Relics)

Without realloc you are left with allocating a new buffer of memory, copying the old buffer to the new buffer, and then freeing the old buffer.

The question remains though, how much memory should you allocate for this new, larger buffer?

You could double your current buffer size whenever you ran out of space. This would mean that as the buffer grew over time, you would do fewer allocations but would have more wasted (allocated but unused) memory.

You could also go the other way and just add 10 more int’s every time you ran out of space. This would mean that you would do a larger number of allocations (more CPU usage, possibly more fragmentation), but you’d end up with less wasted space.

Either way, it obviously depends entirely on usage patterns and it’s all subjective and situational.

…Or is it?

A Surprising Reason For Caring

Believe it or not, growth strategies can make a huge difference. Below we will explore the difference between the seemingly arbitrary rules of making a buffer twice as big, or 1.5 times as big.

Let’s say that we have a bunch of free memory starting at address 0. Let’s analyze what happens as we resize arrays in each scenario.

2x Buffer Size

First let’s see what happens when we double a buffer’s size when it gets full.

We start by allocating 16 bytes. The allocator gives us address 0 for our pointer.

When the buffer gets full, we allocate 32 bytes (at address 16), copy the 16 bytes into it and then free our first 16 byte buffer.

When that buffer gets full, we allocate 64 bytes (at address 48), copy the 32 bytes into it and then free our 32 byte buffer.

Lastly, that buffer gets full, so we allocate 128 bytes (at address 112), copy the 64 bytes into it and then free our 64 byte buffer.

As you can see, doubling the buffer size causes our pointer to keep moving further down in address space, and a free piece of memory before it will never be large enough to hold a future allocation.

1.5x Buffer Size

Let’s see what happens when we make a buffer 1.5x as large when it gets full.

We start by allocating 16 bytes. The allocator gives us address 0 for our pointer.

When the buffer gets full, we allocate 24 bytes (at address 16), copy the 16 bytes into it and then free our first 16 byte buffer.

When that buffer gets full, we allocate 36 bytes (at address 40), copy the 24 bytes into it and free the 24 byte buffer.

When that buffer gets full, we allocate 54 bytes (at address 76), copy the 36 bytes into it and free the 36 byte buffer.

When that buffer gets full, we allocate 81 bytes (at address 130), copy the 54 bytes into it and free the 54 byte buffer.

Lastly, when that buffer gets full, we need to allocate 122 bytes (we rounded it up). In this case, there is 130 bytes of unused memory starting at address 0, so we can just allocate 122 of those bytes, copy our 81 bytes into it and free the 81 byte buffer.

Our allocations have folded back into themselves. Our pattern of resizing hasn’t created an ever moving / ever growing memory fragmentation monster, unlike the buffer size doubling, which has!

Small Print

The above does decrease memory fragmentation, by encouraging an allocation to tend to stay in one spot in memory, but it comes at a cost. That cost is that since it’s allocating less extra memory when it runs out, that you will end up having to do more allocations to reach the same level of memory usage.

That might be a benefit though, depending on your specific needs. Another way of looking at that is that you will end up with fewer bytes of wasted memory. By wasted memory I mean allocated bytes which are not actually used to store anything.

Realloc Makes This Moot Right?

You may be thinking “well if I use realloc, I don’t need to care about this right?”

That isn’t exactly true. If realloc is unable to give you more memory at the current pointer location, it will allocate a new buffer, copy the old data to the new buffer, free the old buffer, and return you the pointer to the new buffer. This is exactly the case that happens when you don’t use realloc.

Using the above growth strategy with realloc makes realloc work even better. It’s a good thing!

Caveat: exotic allocator behavior may not actually benefit from using this strategy with realloc, so have a look for yourself if you are in doubt!

Links

Here’s a discussion on the topic:
What is the ideal growth rate for a dynamically allocated array?

From the link above, apparently the ideal factor to use when upsizing a buffer in general (when worrying about fragmentation like this), is the golden ratio 1.618. Weird, huh?

Have other information about why various growth factors matter? Leave a comment or drop me a line (:

Thanks to Tom for mentioning this concept. Pretty interesting and surprising IMO.

Game Development Needs Data Pipeline Middleware

In 15 years I’ve worked at 7 different game studios, ranging from small to large, working on many different kinds of projects in a variety of roles.

At almost every studio, there was some way for the game to load data at runtime that controlled how it behaved – such as the damage a weapon would do or the cost of an item upgrade.

The studios that didn’t have this setup could definitely have benefited from having it. After all, this is how game designers do their job!

Sometimes though, this data was maintained via excel spreadsheets (export as csv for instance and have the game read that). That is nearly the worst case scenario for data management. Better though is to have an editor which can edit that data, preferably able to edit data described by schemas, which the game also uses to generate code to load that data.

Each studio I’ve worked at that did have game data each had their own solution for their data pipeline, and while they are all of varying qualities, I have yet to see something that is both fast and has most of the features you’d reasonably want or expect.

We really need some middleware to tackle this “solved problem” and offer it to us at a reasonable price so we can stop dealing with it. Open sourced would be fine too. Everyone from engineers to production to content people will be much happier and more productive!

Required Features

Here are the features I believe are required to satisfy most folks:

Be able to define the structure of your data in some format (define data schema).
Have an editor that is able to quickly launch, quickly load up data in the data schema and allow a nice interface to editing the data as well as searching the data.
This edited data should be in some format that merges well (for dealing with branching), and preferably is standardized so you can use common tools on the data – such as XSLT if storing data as xml. XML isn’t commonly very mergable so not sure the solution there other than perhaps a custom merge utility perhaps?
The “data solution” / project file should store your preferences about how you want the final data format to be: xml, json, binary, other? Checkboxes for compression and encryption, etc. Switching the data format should take moments.
There should be a cooking process that can be run from the data editor or via command line which transforms the edited data into whatever format the destination data should be in. AKA turn the human friendly XML into machine friendly binary files which you load in with a single read and then do pointer fixup on.
This pipeline should generate the code that loads and interacts with the data as described in the data schema. For instance you say “load my data” and it does all the decompression, decryption, parsing, etc giving you back a root data structure which contains compile time defined strongly typed structures. This is important because when you change the format of the data that the game uses, no game code actually has to know or care. Whatever it takes to load your data happens when you call the function.

Bonus Points

Here are some bonus point features that would be great to have:

Handle live editing of data. When the game and editor is both open, and data is edited, have it change the data on the game side in real time, and perhaps allow a callback to be intercepted in case the game needs to clear out any cached values or anything. This helps iteration time by letting people make data changes without having to relaunch the game. Also needs to be able to connect to a game over tcp/ip and handle endian correction as needed as well as 32 vs 64 bit processes using the same data.
Handle the usual problems associated with DLC and versioning in an intelligent way. Many data systems that support DLC / Patching / Schema Updates post ship have strange rules about what data you can and can’t change. Often times if you get it wrong, you make a bug that isn’t always obvious. If support for this was built in, and people didnt have to concern themselves with it, it’d be great.
On some development environments, data must be both forwards and backwards compatible. Handling that under the covers in an intelligent way would be awesome.
The editor should be extensible with custom types and plugins for visualizations of data, as well as interactive editing of data. This same code path could be used to integrate parts of the game engine with the editor for instance (slippery slope to making the editor slow, however).
Being able to craft custom curves, and being able to query them simply and efficiently from the game side at runtime would be awesome.
Support “cook time computations”. The data the user works with isn’t always set up the way that would be best for the machine. It’d be great to be able to do custom calculations and computations at runtime. Also great for building acceleration data structures.
You should be able to run queries against the data or custom scripts. To answer questions like “Is anyone using this feature?” and “I need to export data from our game into a format that this other program can read”
Being able to export data as c++ literal data structures, for people who want to embed (at least some of) their data in the exe to reduce complexity, loading times, etc.

It should also be as fast and lightweight as possible. It should allow games to specify memory and file i/o overrides.

Localized text is also a “solved problem” that needs an available solution. It could perhaps be rolled into this, or maybe it would make most sense for it to be separate.

As another example of how having something like this would be useful, on multiple occasions at previous studios, people have suggested we change the format of the data that the game uses at runtime. For instance, from json to a binary format. In each case this has come up so far, people have said it would take too long and it got backlogged (ie killed). With data pipeline middleware that works as i describe it, you would click a few checkboxes, recook your data and test it to have your runtime results. That’s as hard as it SHOULD be, but in practice it’s much harder because everyone rolls their own and the cobbler never has time to fix his shoes (;

Anyone out there want to make this happen? (: