# Rapidly Solving Sudoku, N-Queens, Pentomino Placement, and More, With Knuth’s Algorithm X and Dancing Links.

The C++ code that goes with this post can be found at https://github.com/Atrix256/AlgorithmX

A fun youtube video came out recently showing how people optimized an algorithm from running for an entire month, to eventually running for less than a millisecond: https://www.youtube.com/watch?v=c33AZBnRHks

In that video, algorithm X was mentioned and I was intrigued. Giving it a google, I found this 2018 video by Knuth talking about his “Algorithm X” method to solve exact cover problems, specifically using “Dancing Links” to implement it efficiently: https://www.youtube.com/watch?v=_cR9zDlvP88

So what is an exact cover problem? If you have some tetrominoes (or Tetris pieces) and you want to put them together to form a rectangle, that is an exact cover problem. If you want to put N queens on an NxN chess board such that none can attack each other, that is an exact cover problem. Solving sudoku is also an exact cover problem.

A more formal definition is on Wikipedia at https://en.wikipedia.org/wiki/Exact_cover. Interestingly it says that you can turn any NP problem into an exact cover problem!

I wanted to learn this algorithm to have something in my tool belt to approach these sorts of problems more intelligently, but I also think this will lead to new algorithms for generating noise textures (aka sampling matrices), and/or new types of situationally optimal noise.

Let’s talk about the details of the algorithms a bit and then look at some examples.

## Exact Cover Problems

An exact cover problem is described by a binary matrix of 0s and 1s.

Using Knuth’s terminology, the columns are called items and are the things that are looking to be covered. For example, for placing objects on a chess board, there would be an item (matrix column) per chess board location.

Rows are called options and they contain one or more items that they cover. For example, a tetromino at a specific position on a chess board would cover multiple chess board locations. That option (matrix row) would have a 1 at each of those locations and a 0 at the rest of the locations.

The goal of an exact cover problem is to find a set of options which cover all items exactly once.

An extension to this idea is to have some items be optional. These items don’t have to be covered, but if they are covered, they can only be covered at most once. We’ll make use of this in the N-Queens example.

There are other extensions that I’ll briefly explain at the bottom of this post.

## Algorithm X

Knuth’s Algorithm X is a method for solving exact cover problems. At the highest level, it works like this.

1. Choose an item to cover from the list of items. (only choose from required items)
2. For each option that covers that item, try it as part of the solution:
• Remove each item that option covers. All those items can be considered covered.
• Remove all options that have items that were removed. They are no longer valid choices since they conflict.
• Recurse, having Algorithm X solve the problem with reduced items.
• Undo making this option part of the solution and try the next option.

If the item list becomes empty (or only has optional items left), you have a valid solution.

If any item runs out of valid options, you’ve hit a dead end.

Depending on the problem you are trying to solve, you may find 0 solutions, 1 solution, or multiple solutions. If you are only looking for one solution instead of all possible solutions, you can exit out after finding a solution.

Note that even though you iterate through all the options that cover an item in step 2, you don’t need to iterate through the items in step 1. Recursing will choose the other items, and you will reach all possible solutions.

In step 1 you can choose an item to cover however you want. A very good way to do it though is to choose the item with the fewest items left which is also “the hardest item to cover”. That is what Knuth recommends, and from observation, it aggressively culls the search tree without changing the solutions that can be reached, so is great stuff. In the IGN example, i’ll talk about how this was the difference between a 24 minute run of the algorithm, and a less than 1 millisecond run of the algorithm for the same exact cover problem!

As far as the order that options are tried in, Knuth recommends either doing them in order, or in a randomized order.

In the code I made for this post, if you only want one solution, it will visit the options in random order, but if you want all options, it will do them in order and save the cost of gathering up and shuffling the options.

Dancing links is a technique that can be used when implementing Algorithm X to efficiently add and remove items and options.

One observation in Algorithm X is that the matrix is usually pretty sparse with a lot of 0s. Usually, an individual option only covers a few items, and the vast majority of items are left uncovered by that option.

To deal with this, you can make a doubly linked list of items within an option, only putting in the items that are covered.

Having a sparse matrix made up of doubly linked lists makes it easy to remove nodes, by just doing this:

node->left->right = node->right;
node->right->left = node->left;
node->left = nullptr;
node->right = nullptr;


An observation about doubly linked lists is that you can remove a node from a list but leave the node’s pointers in tact. You do the first 2 lines above but not the bottom 2 lines. If you do that, you can put this node back into the list by just doing this:

node->left->right = node;
node->right->left = node;


This allows you to undo any number of node removals, so long as you re-insert them in the reverse order that you removed them. This reverse order handling comes naturally in a recursive algorithm, such as Algorithm X.

Using doubly linked lists for a sparse matrix, and leaving removed nodes “dirty” to be able to quickly undo their removal are the core concepts of using dancing links to implement Algorithm X. This allows items and options to be removed and restored very efficiently, allowing Algorithm X to be very, very fast.

## Implementation Details

Knuth’s code implements the linked lists as using integer indices into an array, instead of pointers, with the reasoning that they can be smaller than a pointer (like in x64, a 32 bit int is half size) and so are more cache friendly. Another nice thing about using integer indices is that they don’t become invalid if you resize a dynamic array, which I use (Knuth uses fixed size arrays).

I have a list of items, where an item looks like this:

// An item is something to be covered
struct Item
{
char name[8];
int leftItemIndex = -1;
int rightItemIndex = -1;
int optionCount = -1;
};


The name is a label describing the item and is only useful for user friendliness. If you removed it, the algorithm would still work and it would probably be more cache friendly.

There is also an extra item on the end of the list that serves as the “root node” of the linked list. This is the node you start from when iterating through the items.

An “itemIndex” is the index that an item occurs in the m_items array. The left and right item index are the previous and next items in the available items list.

The “optionCount” is a count of how many items there are available for an item. This helps the item with the lowest count be found more quickly when deciding which item to cover next.

The other data type is a node:

// An option is a sequential list of nodes.
struct Node
{
// Spacer nodes use upNodeIndex as the index of the previous spacer node, and downNodeIndex as the next spacer node.
// Non spacer nodes use these to get to the next option for the current item
int upNodeIndex = -1;
int downNodeIndex = -1;

// What item a node belongs to.
// -1 means it is a spacer node.
// There is a spacer node before and after every option.
// An option is a sequential list of nodes.
int itemIndex = -1;
};


Each item has a node, and in fact, the itemIndex is also the index into the m_nodes array where the node lives for that item.

The upNodeIndex and downNodeIndex goes up and down in the sparse array, so starting at the node of an item and iterating through that list is how you find the options available for that item.

Options also have nodes. They have one node per item in the option, and are laid out sequentially in the m_nodes array so that you can add one to a node index to go to the next item in the option, or subtract one from the node index to go to the last item in the option. The itemIndex of a node is the item that the node belongs to.

There is also a “spacer node” at the beginning of each option which can be identified by having an itemIndex of -1. In this case, the upNodeIndex is the previous spacer node, and the downNodeIndex is the next spacer node. This is useful when iterating through all items in an option to be able to go back to the beginning of the option. There is also a spacer node at the end of the options to simplify code by being able to make the assumption that there is always a spacer node at the beginning and end of the items.

The other thing worth mentioning is that when covering an item, you remove it from the list of items, and then have all current options of that item remove themselves from all other lists except the item being covered. This is necessary so that when this item is uncovered, you can iterate through the options that it still knows about, and have each add itself back into the necessary lists (restore the up and down pointers).

With the implementation details out of the way, let’s see this algorithm in action!

## Basic Examples

The code for this section is in BasicExamples() in main.cpp.

Knuth has a basic example in his write up (https://www-cs-faculty.stanford.edu/~knuth/programs/dlx1.w):

The items are A,B,C,D,E,F,G where F and G are optional.

The options are:

• CEF
• BCF
• BG
• DEG

The goal is to find the set of options which covers A,B,C,D,E and optionally may cover F or G, but if they do, must only cover it once.

Running the code that goes with this blog post agrees with what he says the right answer is…

There is only one solution, which is to use the options: CEF, AD, BG

Wikipedia has another simple example (https://en.wikipedia.org/wiki/Exact_cover#Detailed_example):

The items are 1,2,3,4,5,6,7 and they are all required.

The options are:

• 147 (named A)
• 14 (named B)
• 457 (named C)
• 356 (named D)
• 2367 (named E)
• 27 (named F)

Once again there is only a single solution, which is 14 (B), 356 (D), 27 (F).

Wikipedia then goes on to talk about an “Exact Hitting Set” (https://en.wikipedia.org/wiki/Exact_cover#Exact_hitting_set) which is different than an exact cover problem, but is solved by transposing the exact cover matrix and then solving that. Check the wikipedia page for a description of what sort of problems these are.

The items to cover are the names of the options: A, B, C, D, E, F.

Since the matrix is transposed from the exact cover problem, each option (row) is the list of options in the previous problem which contain each number 1 through 7:

• AB (these are the ones that contained a 1 in the last problem)
• EF (ditto for 2)
• DE (3)
• ABC (4)
• CD (5)
• DE (6)
• ACEF (7)

Solving that gives a single answer again: AB, EF, CD.

These are basically diagnostic tests to help make sure the algorithm is working, and help show how it works.

Here is the program output. It found the solutions amazingly fast (< 1 millisecond), but that isn’t too surprising as these are pretty simple problems.

## N-Rooks

Now for something a little more tangible… let’s say we had a standard 8×8 chess board and we are trying to place 8 rooks on it such that no rook could attack any other rook.

The code for this section is in NRooks.h.

In this problem, each row must have a rook in it (and only 1), so we have 8 required items named Y1 through Y8. Each column must also have a rook in it (and only 1), so we have 8 more required items named X1 through X8 making for 16 items total.

You then make an option for each location on chess board, which means 64 options total.

Within each option, you put a 1 for the row and column that is covered by placing a rook there. Putting a 1 in an item for an option means just having that item be in the list for that option.

For instance, placing a rook at location (0,0) on the chess board would put a 1 in X0 and a 1 Y0. Placing a rook at location (4,5) would put a 1 in X4 and a 1 in Y5.

Below is the program output, where it found 40320 solutions in 66 milliseconds, and shows the first 4 solutions it found.

This is for a standard 8×8 chess board with 8 rooks on it, but the problem (and supplied code!) generalizes to an NxN chess board with N rooks on it and is called the N-Rooks problem.

## N-Queens

The N-Queens problem is the same as the N-Rooks problem, but using queens instead of rooks.

The code for this section is in NQueens.h.

We use the same items for queens as we do for rooks, but we also add the diagonals since queens can attack diagonally. In this case, there are more than N diagonals though so we can’t require that all diagonals are used by queens. We deal with this by making the diagonals be optional items.

There are 2N-1 diagonals going left, and 2N-1 diagonals going right, so for when N=8, we add 30 optional items to the 16 required items that rooks have.

The results are below… 92 solutions found in 7 milliseconds! Wikipedia agrees that there are 92 solutions https://en.wikipedia.org/wiki/Eight_queens_puzzle.

For comparison, when N=12 it finds all answers in 600 ms. When N=13 it finds them in 2.7 seconds. When N=14 it finds them (365,596 solutions) in 14 seconds. Don’t forget that this is a single threaded algorithm running on the CPU!

## Sudoku

The code for this section is in Sudoku.h.

Sudoku is a game played on a 9×9 grid where there are some numbers 1-9 already present on the board and you need to fill in the blank spots such that each row has all numbers 1-9 exactly once, so does each column, and so does each 3×3 block of cells (9 total 3×3 blocks of cells, they don’t overlap).

For solving sudoku, the items to cover are all required and they are:

• 81 for cells: Each of the 81 cells must have a value (it doesn’t matter which value)
• 81 for columns: Each of the 9 columns must have each possible value 1-9.
• 81 for rows: Each of the 9 rows must have each possible value 1-9.
• 81 for blocks: Each of the nine 3×3 blocks must have each possible value 1-9.
• 1 for initial state: Since there is an initial state for the board, one option will describe the initial state and this item ensures that it is always part of any solution found.

That makes for a total of 325 items to cover.

As far as options go, there will be one option for each value in each cell, which means 81*9 = 729 options. We only make options for the cells where the initial state is blank though, which cuts down the actual number of options needed. In our example, we’ll have 30 numbers already covered, leaving 51 uncovered. That means we’ll only need 51*9 = 459 options.

There will also be one option for the initial state, which is the only one that has the initial state item covered, making for a total of 460 options.

As an example of what an option looks like, let’s consider the option of putting a 4 in (3,5). That option will have the following items:

• Cell (3,5) has a value in it
• Column 3 has a 4 in it.
• Row 5 has a 4 in it.
• Block (1,2) has a 4 in it.

Each option has to describe all the things it covers.

Wikipedia has an example Sudoku problem that I plugged in (https://en.wikipedia.org/wiki/Sudoku) and then ran the program. It was able to find the same solution listed on wikipedia and verify that it only had one solution, in 2 milliseconds.

## Plus Noise

The code for this section is in PlusNoise.h.

I have a blog post that talks about a “Plus Noise Low Discrepancy Grid” (https://blog.demofox.org/2022/02/01/two-low-discrepancy-grids-plus-shaped-sampling-ldg-and-r2-ldg/). This is a 5×5 matrix of numbers where if you look at any pixel, and the 4 cardinal direction neighbors (toroidally wrapping around at the edges) that it will have each number 1 through 5.

The post shows how to find a formula for that grid, but let’s try see what this algorithm can come up with.

Our items to cover will be:

• 25 for cells: Each of the entries in the 5×5 matrix needs to have a value (but it doesn’t matter what value is there).
• 125 for plus shapes: There is a plus shape with a center point of each entry in the matrix, so 25 plus shapes. Each plus shape needs each value 1 through 5, so 25*5 = 125 items.

That makes for 150 items total.

As far as options, we need an option for each possible value at each possible location, which means 25*5 = 150 options.

As an example, the option for putting a 2 in (3,4) would have these items:

• Cell (3,4) has a number
• The plus with a center at (3,4) has a 2 in it.
• The plus with a center at (2,4) has a 2 in it. (The plus to the left)
• The plus with a center at (4,4) has a 2 in it. (The plus to the right)
• The plus with a center at (3,3) has a 2 in it. (The plus above)
• The plus with a center at (3,0) has a 2 in it. (The plus below. Note the y axis wraps here)

Running that finds 240 solutions in 14 milliseconds! The original blog post found the formula (x+3y)%5, but after looking at the results, I realized that (x+2y)%5 is also a solution. So that takes care of 2 solutions, what about the other 238?

Well, one thing to realize is that if you have a solution, you can swap any 2 numbers and have another valid solution. This means that you essentially have 5 symbols (0,1,2,3,4) that you can arrange in any way to find a different solution. That means 5! = 120 solutions. Since there are two different formulas for generating this 5×5 matrix, and each has 120 solutions, that accounts for the 240 solutions this found.

Something interesting is that I tried this with smaller (3×3 and 4×4) grid sizes and it wasn’t able to find a solution. Trying larger sized grids, I was only able to find solutions when the size was a multiple of 5, and it still would only find 240 solutions. This matrix tiles, so this all makes a decent amount of sense. The larger sizes that are multiples of 5 just found a solution by tiling the 5×5 solution, interestingly.

The code for this section is in IGN.h.

In a previous blog post I wrote about Interleaved Gradient Noise, which is a low discrepancy grid created by Jorge Jimenez while he was at Activision (https://blog.demofox.org/2022/01/01/interleaved-gradient-noise-a-different-kind-of-low-discrepancy-sequence/). IGN is a screen space noise designed to leverage the neighborhood sampling that temporal anti aliasing uses to know when to keep or reject history. Using IGN as the source of per pixel random numbers in stochastic rendering under TAA thus gives better renders for the same costs. It works even better than blue noise in this situation, believe it or not.

This noise is a generalized sudoku where the rows and columns must all have values 1 through 9, but instead of only the 9 main 3×3 blocks needing to have each value 1 through 9, it’s EVERY 3×3 block needing to have that, including overlapping blocks. There is a block with a center at each location of the 9×9 matrix, so there are 81 3×3 blocks.

The items are then…

• 81 for cells: Each of the 81 cells must have a value (it doesn’t matter which value)
• 81 for columns: Each of the 9 columns must have each possible value 1-9.
• 81 for rows: Each of the 9 rows must have each possible value 1-9.
• 729 for blocks: Each cell is the center of a 3×3 block, and each of them must have each value 1-9. So 81*9 = 729 items.

This is 891 items to cover.

Note there is no initial state in this setup, unlike the sudoku section. We want it to find ANY solution it can.

The number of options is still that each cell could have each possible value, so its 81*9 = 729 options.

It rapidly found that there were no solutions. It took less than a millisecond and the search space ended up being very small. Having the program spit out the steps it was taking (look for SHOW_ALL_ATTEMPTS), I found that it was able to quickly find a situation that couldn’t be satisfied and that it was right.

Fun fact, I previously had it choosing items to cover from left to right, instead of covering the item with the fewest options remaining. When I did that, the program took 25 minutes to find the answer as you can see below! The cover fewest options item heuristic is pretty powerful, and gives some intuition behind why the Wave Function Collapse algorithm (https://github.com/mxgmn/WaveFunctionCollapse) has the same heuristic – collapse whichever item has the fewest options.

So this algorithm confirms that there is no exact solution to the constraints that IGN sets up. IGN is an approximate solution though, and has numerical drift across space to distribute the error of not exactly solving the constraints. IGN itself was tuned by hand and eye, so I would bet is not be the optimal formula for being an approximate solution to these constraints. I’m not sure how you’d find the optimal formula though (let’s say, optimal in a least squared error sense?).

Anyhow it made me think… as far as TAA is concerned, it’s primarily the 3×3 block constraints that matter to the neighborhood sampling, and not the rows and columns constraints. So, I made an IGNRelaxed() function which doesn’t have them. After running for 13 minutes, it’s found 49,536 solutions. I don’t have a very exact way to measure progress but it didn’t seem anywhere near done so I killed it.

I’m not sure if this sampling pattern would be any good in practice. On one hand, it is optimized towards TAA’s neighborhood sampling, but it doesn’t have anything to fight regular structure or promote diversity of values in small regions. Maybe it has a use, maybe not. If you give it a try, let me know!

## UPDATE: Extensions Briefly Explained

Here are some other extensions to this algorithm:

• Knuth’s DLX2 has the ability for optional items to have a color, and be allowed to be covered by any number of items of the same color. An example usage case is in making a cross word puzzle. The letters could be the color, and it’s ok if two words claim the same letter in the same cell where the words overlap.
• DLX3 allows you to specify a min and max number of coverings per item instead of requiring exactly 1 coverings. This could let you set up constraints like putting N queens and 1 king on a chess board and making it so between 2 and 4 of the queens must be attacking the king.
• DLX5 lets you give a cost per option, and a tax (discount) per item. This lets you find minimum cost solutions. The cost doesn’t drive the search though, unfortunately. It searches exhaustively and gives you the k best scoring solutions.

I’ll bet you can imagine other extensions as well. This basic algorithm seems very natural to modify towards other specific goals, or constraint types.

AlgorithmX using Dancing Links is super powerful and super fast. I feel like more folks need to know about these techniques.

My goal was to see if this algorithm (+ extensions) could be used as an alternative to void and cluster and simulated annealing for generating noise textures (sampling matrices). After learning more, it looks like it could, but only in narrow & rare usage cases. It has the ability to score solutions, but the scoring doesn’t drive the search. For making blue noise without any other hard constraints, it would just be a brute force search then.

I think this algorithm could be made multithreaded if needed, by splitting the “search tree” up and giving a piece to each thread. It’s hard to know how to split this fairly though, so might need some sort of work stealing setup. I’m not real sure, but it seems doable.

Knuth’s writeup on AlgorithmX and Dancing Links (DLX1): https://www-cs-faculty.stanford.edu/~knuth/programs/dlx1.w

Here are other programs he has written, including extensions to DLX (DLX2, DLX3, etc): https://www-cs-faculty.stanford.edu/~knuth/programs.html

Wikipedia:

The video that started me down this rabbit hole: https://www.youtube.com/watch?v=c33AZBnRHks

Here is his explanation of it too: https://twitter.com/kmett/status/1585133038135975936?s=20&t=0IWdGtwBR-vN1V26eIN68A

# Calculating Discrete Sums With Umbral Calculus

Before we get into umbral calculus I want to talk about discrete calculus.

Discrete calculus lets you do calculus operations on discrete functions. These are functions like $y=f(x)$ where x has to be an integer. Limits can no longer get arbitrarily close to 0 in discrete calculus since the smallest difference between integers is 1. One consequence of that is that derivatives (aka deltas aka $\Delta$) are calculated using finite differences, commonly forward differences. When the smallest valued difference is 1, it falls out naturally from the definition of the derivative, setting h to 1.

$\frac{d}{dx} f(x) = \lim_{h \to 0} \frac{f(x+h)-f(x)}{h}$

$\Delta f = f(x+1) - f(x)$

A nice thing is that there are lots of similarities between discrete calculus and regular calculus, such as the derivative of a constant being zero:

$\frac{d}{dx} c = 0$

$\Delta c = 0$

But there are plenty of differences. The product rule in calculus is different than the one in discrete calculus for example. (Note: this is because the extra term has a multiplier that approaches the limit, which is zero in regular calculus, but 1 in discrete calculus.)

$\frac{d}{dx}(f(x)g(x)) = \frac{d}{dx}(f(x)) g(x) + f(x) \frac{d}{dx}(g(x))$

$\Delta(f(x)g(x)) = \Delta(f(x) )g(x) + f(x) \Delta(g(x)) + \Delta(f(x)) \Delta(g(x))$

Here’s where umbral calculus comes in. It lets you turn a discrete calculus problem into a continuous calculus problem, and it lets you turn a continuous calculus problem into a discrete one.

In this post we are going to leverage umbral calculus to see how to craft functions that sum discrete functions. If you have a discrete quadratic polynomial f, you’ll get a cubic polynomial g that gives you the sum for all values f(n) for n ranging from 0 to x-1.

## How To Do It

Firstly, we need a function that spits out the numbers we want to sum. The function input needs to be integers, but the function output doesn’t need to be.

Once we have our function, the process to get the function $g(x)$ that calculates sums of $f(x)$ is this:

1. Turn the discrete function into a continuous function.
2. Integrate the continuous function.
3. Convert the result from a continuous function back into a discrete function.

If we use $\Phi$ (phi) to denote converting a function from continuous to discrete, and $\Phi^{-1}$ to denote converting a function from discrete to continuous, it looks like this mathematically:

$g(x) = \Phi \int \Phi^{-1} f(x) dx$

When you plug an integer x into g, you get out the sum of $f(n)$ for all values of n from 0 to x-1.

If you want to get the sum of all values of $f(x)$ from a to b, you can calculate that as $g(b+1)-g(a)$. This works because if you wanted the sum of numbers from 10 to 20, you could sum up all the numbers from 0 to 20, and subtract out the numbers from 0 to 10. This is also a discrete calculus definite integral.

That’s the overview, let’s talk about how to calculate $\Phi$ and $\Phi^{-1}$.

## Calculating Φ

Phi is the operator for converting continuous functions to discrete functions. Phi of powers of x are real easy to compute: They turn powers of x into falling factorials of x which are denoted $x_n$. So, $x^n$ becomes $x_n$.

The oddity of this mysterious conversion is what gave umbral (shadow) calculus it’s name. It took a while for people to understand why this works and formalize it. Until then they just knew it worked (although they had counter cases to an imperfect understanding so it didn’t always work), but didn’t know why, so was a bit mysterious.

The above makes it very easy to work with polynomials. In a couple sections will be a cheat sheet with a couple other transformations I scrounged up.

Some other rules of the phi operator include:

$\Phi(a f(x)) = a \Phi f(x)$

$\Phi(f(x) + g(x)) = \Phi f(x) + \Phi g(x)$

$\Phi f(x) \neq f(x) \Phi$

The last one is meant to show that left and right multiplication are not the same thing, so ordering needs to be carefully considered and conserved.

Here are a couple other useful properties that might be helpful, that I’ve scraped from other sources 🙂

$\Delta x_n = n x_{n-1}$

$\Delta \Phi \sin(x) = \Phi \cos(x)$

$\Phi e^{-x} = 0$

$(\Phi \sin(x))^2 + (\Phi \cos(x))^2 = \Phi e^x$

$\Delta 2^x = 2^x$

The last one implies that in discrete calculus, $e$ is 2. That is strange, isn’t it? If you work it out with a few values, you can see that it’s true!

This section is far from exhaustive! Check out the links at the bottom of this post for some more information.

Let’s apply phi to a polynomial as an example of how it works…

$f(x) = 3x^2 + \frac{1}{2}x$

$\Phi f(x) = \Phi (3x^2 + \frac{1}{2}x)$

$\Phi f(x) = \Phi 3x^2 + \Phi \frac{1}{2}x$

$\Phi f(x) = 3 \Phi x^2 + \frac{1}{2} \Phi x$

$\Phi f(x) = 3 (x^2-x) + \frac{1}{2} x$

$\Phi f(x) = 3x^2 - 3x + \frac{1}{2} x$

$\Phi f(x) = 3x^2 - 2.5x$

## Calculating Inverse Φ

Calculating inverse phi is less straight forward – at least as far as I know. There may be a more formulaic way to do it that I don’t know of. I’ll show the inverse phis I know and how they are derived, which should hopefully help if you need others.

x

Firstly, the inverse phi of x is just x. That’s an easy one.

x squared

Next up is the inverse phi of x squared. We know what phi of x squared is from the last section, so let’s start with that.

$\Phi x^2 = x^2 - x$

Since x is the same as the phi of x, we can replace the last term with phi of x.

$\Phi x^2 = x^2 - \Phi x$

Now let’s get the naked x squared term on the left side, and all the phi terms on the right side.

$x^2 = \Phi x^2 + \Phi x$

Lastly, we’ll left multiply both sides by inverse phi, and we are done!

$\Phi^{-1} x^2 = x^2 + x$

x cubed

$\Phi x^3 = x^3 - 3x^2 + 2x$

Let’s turn $2x$ into $3x-x$.

$\Phi x^3 = x^3 - 3x^2 + 3x - x$

Next we factor -3 out of the middle two terms.

$\Phi x^3 = x^3 - 3(x^2 - x) - x$

The middle term is 3 times phi of x squared, so we can replace it with that. The -x on the end outside of the parens is also just negative phi of x so we’ll replace that too.

$\Phi x^3 = x^3 - 3 \Phi x^2 - \Phi x$

Like last time, let’s get the phi terms on the right and the non phi term on the left.

$x^3 = \Phi x^3 + 3 \Phi x^2 + \Phi x$

And again like last time, we can now left multiply by inverse phi to finish!

$\Phi^{-1} x^3 = x^3 + 3 x^2 + x$

## Cheet Sheet

Here is a table of phi and inverse phi transforms.

The coefficients on the polynomials made from powers of x are the Stirling numbers. Phi relates to signed Stirling numbers of the first kind, and inverse phi relates to unsigned Stirling numbers of the second kind. That fact helps if you need to work with higher powers of x.

## Example 1: f(x) = x

Let’s use what we know to come up with a function $g(x)$ which sums all integers from 0 to x-1.

Our recipe for making g from f starts out like this:

$g(x) = \Phi \int \Phi^{-1} f(x) dx$

Our function is just $f(x) = x$, so we start at:

$g(x) = \Phi \int (\Phi^{-1} x) dx$

The inverse phi of x is just x.

$g(x) = \Phi \int (x) dx$

Integrating x gives us one half x squared. We can move the half outside of the phi.

$g(x) = \frac{1}{2} \Phi x^2$

We know the phi of x squared so we can plug it in and be done!

$g(x) = \frac{1}{2}(x^2 - x)$

If we plug in 1 to get the sum of all integers from 0 to 0, we get 0.

If we plug in 2 to get the sum of all integers from 0 to 1, we get 1. That is 0 + 1.

Plugging in 3 for summing 0 to 2, we get 3. That is 0 + 1 + 2.

Plugging in 4 gives us 6. That is 0 + 1 + 2 + 3.

Plugging in 5 gives us 10. 10 gives us 45. 100 gives us 4950. 200 gives us 19900. Plugging in a million gives us 499,999,500,000!

If we wanted to sum the integers between 5 and 9, we can do that too! We subtract g(5) from g(10), which is 45-10 or 35. That is 5+6+7+8+9.

So, we know that summing the integers between 100 and 199 is 14,950, since we have g(100) and g(200) above, which are 4950 and 19900 respectively.

## Example 2: f(x) = x^2

Let’s find the function to sum squares of integers.

$g(x) = \Phi \int \Phi^{-1} x^2 dx$

Applying inverse phi…

$g(x) = \Phi \int (x^2 + x) dx$

integrating…

$g(x) = \Phi (\frac{1}{3}x^3 + \frac{1}{2}x^2)$

Splitting it into two phi terms, and moving the fractions out of phi…

$g(x) = \frac{1}{3} \Phi x^3 + \frac{1}{2} \Phi x^2$

Applying phi…

$g(x) = \frac{1}{3} (x^3-3x^2+2x) + \frac{1}{2} (x^2 - x)$

$g(x) = \frac{1}{6}(2x^3-6x^2+4x) + \frac{1}{6}(3x^2-3x)$

$g(x) = \frac{1}{6}(2x^3-3x^2+x)$

Plugging in 2 gives 1, which is 0 + 1.

3 gives 5, which is 0 + 1 + 4.

4 gives 14, which is 0 + 1 + 4 + 9.

5 gives 30, which is 0 + 1 + 4 + 9 + 16.

if we want to get the sum of values for x between 2 and 4, we calculate g(5)-g(2), which is 30-1 = 29. That is 4+9+16.

It works!

## Example 3: f(x) = x^2 – x/2

This last example is to show that while the function f has to take in integers, it doesn’t have to give out integers.

Let’s do the dance…

$g(x) = \Phi \int \Phi^{-1} (x^2 - \frac{1}{2}x) dx$

$g(x) = \Phi \int (\Phi^{-1} x^2 - \frac{1}{2} \Phi^{-1} x) dx$

$g(x) = \Phi \int (x^2 + x - \frac{1}{2} x) dx$

$g(x) = \Phi \int (x^2 + \frac{1}{2} x) dx$

$g(x) = \Phi (\frac{1}{3} x^3 + \frac{1}{4} x^2)$

$g(x) = \frac{1}{3} \Phi x^3 +\frac{1}{4} \Phi x^2$

$g(x) = \frac{1}{3} (x^3-3x^2+2x) +\frac{1}{4}(x^2 - x)$

$g(x) = \frac{1}{12} (4x^3-12x^2+8x + 3x^2 - 3x)$

$g(x) = \frac{1}{12} (4x^3-9x^2+5x)$

Plugging in 1 gives 0.

2 gives 1/2, which is 1 – 1/2.

3 gives 3.5, which is (1-1/2) + (4-1).

4 gives 11, which is (1-1/2) + (4-1) + (9-1.5).

5 gives 25, which is (1-1/2) + (4-1) + (9-1.5) + (16-2).

6 gives 47.5, which is (1-1/2) + (4-1) + (9-1.5) + (16-2) + (25-2.5).

It works!

Finding formulas that sum ranges of discrete functions is pretty cool, but is it very useful?

Well, if you divide a sum by the count of values in the sum, that gives you an average.

That average is also equivalent to doing Monte Carlo integration using regular spaced samples. I’m not a fan of regular spaced samples, but for constant time sampling of arbitrary sample counts, I think I might make an exception!

How about converting problems between the continuous and discrete domain, is that very useful for other things? Not real sure.

Can this be extended to multi dimensional integrals? I bet so!

This video is what first turned me onto this topic and it is quite good. I’ve watched it about 10 times, trying to scrape all the info out of it. https://www.youtube.com/watch?v=D0EUFP7-P1M

This video is also very good. https://www.youtube.com/watch?v=ytZkmV-7EbM

# Calculating the Similarity of Histograms or PDFs & Interpolating Them Using the p-Wasserstein Distance

The code that goes with this post is at https://github.com/Atrix256/OT1D.

The title of this post is a mouthful but the topic is pretty useful.

Let’s say that we have two histograms:

• 1,2,3,1,0,0,0,0,0,0
• 0,0,0,0,0,0,1,2,3,1

When you want to see how similar two histograms are, some common ways are:

1. Treat the histograms as high dimensional vectors and dot product them. Higher values mean more similar.
2. Treat the histograms as high dimensional points and calculate the distance between them, using the regular distance formula (aka L2 norm or Euclidian norm), taxi cab distance (aka L1 norm) or others. Lower values mean more similar.

Those aren’t bad methods, but they aren’t always the right choice depending on your needs.

1. If you dot product these histograms as 10 dimensional vectors, you get a value of zero because there’s no overlap. If there are more or fewer zeros between them, you still get zero. This gives you no idea how similar they are other than they don’t overlap.
2. If you calculate the distance between these 10 dimensional points, you will get a non zero value which is nice, but adding or removing zeros between them, the value doesn’t change. You can see this how the length of the vector (1,0,1) is the square root of 2, which is also the length of the vector (1,0,0,1) and (1,0,0,0,1). It doesn’t tell you really how close they are on the x axis.

In this case, let’s say we want a similarity metric where if we separate these histograms with more zeros, it tells us that they are less similar, and if we remove zeros or make them overlap, it tells us that they are more similar.

This is what the p-Wasserstein distance gives us. When we interpolate along this distance metric from the first histogram to the second, the first histogram slides to the right, and this is actually 1D optimal transport. This is in contrast to us just linearly interpolating the values of these histograms, where the first histogram would shrink as the second histogram grew.

The title of this post says that we will be working with both histograms and PDFs (probability distribution functions). A PDF is sort of like a continuous version of a histogram, but integrates to 1 and is limited to non negative values. In fact, if a histogram has only non negative values, and sums up to 1, it can be regarded as a PMF, or probability mass function, which is the discrete version of a PDF and is used for things like dice rolls, coin flips, and other discrete valued random events.

You can do similar operations with PDFs as you can with histograms to calculate similarity.

1. Instead of calculating a dot product between vectors, you take the integral of the product of the two functions. Higher values means more similar, like with dot product. $\int_0^1 f(x)g(x) \, dx$. This is a a kind of continuous dot product.
2. Instead of calculating a distance between points, you take the integral of the absolute value of the difference of the two functions raised to some power and then the result taken to one over the same power $(\int_0^1 \left| f(x) - g(x) \right| ^n\, dx)^{1/n}$. Lower values means more similar, like with distance. This is a kind of continuous distance calculation.

Also, any function that doesn’t have negative y values can be normalized to integrate to 1 over a range of x values and become a PDF. So, when we talk about PDFs in this post, we are really talking about any function that doesn’t have negative values over a specific range.

## p-Wasserstein Distance Formula

“Optimal Transport: A Crash Course” by Kolouri et. al (https://www.imagedatascience.com/transport/OTCrashCourse.pdf) page 46 says that you can calculate p-Wasserstein distance like this (they use different symbols):

$(\int_0^1 \left| F^{-1}(x) - G^{-1}(x) \right| ^p \, dx)^{1/p}$

That looks a bit like the continuous distance integral we saw in the last section, but there are some important differences.

Let me explain the symbols in this function. First is $p$ which is where the p in p-Wasserstein distance comes from. It can be 1, 2, 1.5, or any other number. If using the value 2, you’d call it 2-Wasserstein distance, and it would be dealing with the squares of differences between these functions $F^{-1}(x)$ and $G^{-1}(x)$.

Let’s say that we have a PDF called f. We can integrate that to get a CDF or cumulative distribution function that we’ll call F. We can then switch x and y and solve for y to get the inverse of that function, or $F^{-1}$. If we have another PDF called g, we can follow the same process to get $G^{-1}$.

When we calculate the integral above, either analytically / symbolically if we are able, or numerically otherwise, we get the p-Wasserstein distance which has the nice properties we described before.

A fun tangent: if you have the inverse CDF of a PDF, you can plug in a uniform random number in [0,1] to the inverse CDF and get a random number drawn from the original PDF. It isn’t always possible or easy to get the inverse CDF of a PDF, but this also works numerically – you can make a table out of the PDF, make a table for the CDF, and make a table for the inverse CDF from that, and do the same thing. More info on all of that at https://blog.demofox.org/2017/08/05/generating-random-numbers-from-a-specific-distribution-by-inverting-the-cdf/.

## Calculating 2-Wasserstein Distance of Simple PDFs

Let’s start off by calculating some inverse CDFs of simple PDFs, then we’ll use them in the formula for calculating 2-Wasserstein distances.

Uniform PDF

First is the PDF $y=1$, which is for uniform random values. It integrates to 1 for x between 0 and 1 and is positive in that range as well.

If we integrate that to get the CDF, we get $y=x$.

If we flip x and y, and solve again for y to get the inverse CDF, we get $y=x$.

Linear PDF

The next PDF is $y=2x$. It integrates to 1 for x between 0 and 1 and is positive in that range as well.

If we integrate that, the CDF is $y=x^2$.

Flipping x and y and solving for y gives us the inverse CDF $y=\sqrt{x}$.

The last PDF is $y=3x^2$. It integrates to 1 for x between 0 and 1 and is positive in that range as well.

If we integrate that, the CDF is $y=x^3$

Flipping x and y and solving for y gives us the inverse CDF $y=\sqrt[3]{x}$.

2-Wasserstein Distance From Uniform to Linear

The 2-Wasserstein distance formula looks like this:

$(\int_0^1 \left| F^{-1}(x) - G^{-1}(x) \right| ^2 \, dx)^{1/2}$

Let’s square the function for now and we’ll square root the answer later. Let’s also replace $F^{-1}$ with the uniform PDF inverse CDF, and replace $G^{-1}$ with the linear PDF inverse CDF. We can drop the absolute value since the square does that for us.

$\int_0^1 (x - \sqrt{x})^2 \, dx$

Expanding the equation, we get:

$\int_0^1 x^2 -2x \sqrt{x} + x \, dx$

Luckily we can break that integral up into 3 integrals:

$\int_0^1 x^2 \, dx - \int_0^1 2x \sqrt{x}\, dx + \int_0^1 x \, dx$

https://www.wolframalpha.com/ comes to the rescue here where you can ask it “integrate y=x^2 from 0 to 1” etc to get the answer to each integral.

$1/3 - 4/5 +1/2 = 1/30$

Remembering that we removed the square root from the formula at the beginning, we have to square root this to get our real answer.

$\sqrt{1/30} = \sqrt{30}/30 = 0.18257418583$

2-Wasserstein Distance From Uniform to Quadratic

We can do this again measuring the distance between uniform and quadratic.

$\int_0^1 (x - \sqrt[3]{x})^2 \, dx$

$\int_0^1 x^2-2 x \sqrt[3]{x} + \sqrt[3]{x}^2 \, dx$

$1/3 - 6/7 +3/5 = 8/105$

Square rooting that gives:

$\sqrt{8/105} = (2 * \sqrt{210})/105 = 0.27602622373$

2-Wasserstein Distance From Linear to Quadratic

We can do this again between linear and quadratic.

$\int_0^1 (\sqrt{x} - \sqrt[3]{x})^2 \, dx$

$\int_0^1 x - 2 \sqrt{x} \sqrt[3]{x} + \sqrt[3]{x}^2 \, dx$

$1/2 - 12/11 + 3/5 = 1/110$

Square rooting that gives:

$\sqrt{110}/110 = 0.09534625892$

These results tell us that uniform and quadratic are most different, and that linear and quadratic are most similar. Uniform and linear are in the middle. That matches what you’d expect when looking at the graphs.

## Calculating p-Wasserstein Distance of More Complicated PDFs

If you have a more complex PDF that you can’t make an inverted CDF of, you can still calculate the distance metric numerically.

You start off by making a table of PDF values at evenly spaced intervals, then divide all entries by the sum of all entries to make it a tabular form of the PDF (it sums to 1). You then make a CDF table where each entry is the sum of PDF values at or before that location.

Now to evaluate the inverted CDF for a value x, you can do a binary search of the CDF table to find where that value x is. You also calculate what percentage the value is between the two indices it’s between so that you have a better approximation of the inverted CDF value. Alternately to a binary search, you could make an ICDF table instead.

If that sounds confusing, just imagine you have an array where the x axis is the array index and the y axis is the value in the array. This is like a function where if you have some index you can plug in as an x value, it gives you a y axis result. That is what the CDF table is. If you want to evaluate the inverted CDF or ICDF, you have a y value, and you are trying to find what x index it is at. Since the y value may be between two index values, you can give a fractional index answer to make the ICDF calculation more accurate.

Now that you can make an inverted CDF table of any PDF, we can use Monte Carlo integration to calculate our distance metric.

Let’s look at the formula again:

$(\int_0^1 \left| F^{-1}(x) - G^{-1}(x) \right| ^p \, dx)^{1/p}$

To calculate the inner integral, we pick a bunch of x values between 0 and 1, plug them into the equation and average the results that come out. After that, we take the answer to the 1/p power to get our final answer. The more samples you take of x, the more accurate the answer is.

$\text{result}^p = \int_0^1 \left| F^{-1}(x) - G^{-1}(x) \right| ^p \, dx$

Here is some C++ that implements that, with the ICDF functions being template parameters so you can pass any callable thing in (lambda, function pointer, etc) for evaluating the ICDF.

template <typename ICDF1, typename ICDF2>
float PWassersteinDistance(float p, const ICDF1& icdf1fn, const ICDF2& icdf2fn, int numSamples = 10000000)
{
// https://www.imagedatascience.com/transport/OTCrashCourse.pdf page 45
// Integral from 0 to 1 of abs(ICDF1(x) - ICDF2(x))^p
// Then take the result to ^(1/p)
double ret = 0.0;
for (int i = 0; i < numSamples; ++i)
{
float x = RandomFloat01();
float icdf1 = icdf1fn(x);
float icdf2 = icdf2fn(x);
double y = std::pow(std::abs((double)icdf1 - (double)icdf2), p);
ret = Lerp(ret, y, 1.0 / double(i + 1));
}
return (float)std::pow(ret, 1.0f / p);
}


If both ICDFs are tables, that means they are both piecewise linear functions. You can get a noise free, deterministic calculation of the integral by integrating the individual sections of the tables, but the above is a generic way that will work for any (and mixed) methods for evaluating ICDFs.

Below is a table of the 2-Wasserstein distances calculated different ways. The analytic truth column is what we calculated in the last section and is the correct answer. The numeric function column is us running the above numerical integration code, using the inverted CDF we made symbolically in the last section. The last column “numeric table” is where we made a tabular inverted CDF by taking 10,000 PDF samples and making a 100 entry CDF, then using binary search to do inverse CDF evaluations.

For kicks, here are p values of 1 and 3 to compare against 2. Calculated using the “numeric table” method.

## Calculating p-Wasserstein Distance of PMFs

If you have a random variable with discrete random values – like a coin toss, or the roll of a 6 sided dice – you start with a PMF (probability mass function) instead of a PDF. A PMF says how likely each discrete event is, and all the events add up to 1.0. The PMF is probably already a table. You can then turn that into a CDF and continue on like we did in the last section. Let’s see how to make a CDF for a PMF.

When you evaluate y=CDF(x), the y value you get is the probability of getting the value x or lower. When you turn a PMF into a CDF, since it’s discrete, you end up with a piecewise function.

For instance, here is the the PMF 0.1, 0.3, 0.2, 0.4.

The CDF is 0.1, 0.4, 0.6, 1.0.

With a CDF you can now do the same steps as described in the last section.

## Interpolating PDFs & Other Functions

Interpolating over p-Wassertstein distance is challenging in the general case. I’m going to dodge that for now, and just show the nice and easy way to interpolate over 1-Wasserstein distance.

First let’s see what happens if we interpolate two gaussian PDFs directly… by linearly interpolating PDF(x) for each x to make a new PDF. As you can see, the gaussian on the left shrinks as the gaussian on the right grows.

If we instead lerp ICDF(x) for each x to make a new ICDF, invert it to make a CDF, then differentiate it to get a PDF, we get this very interesting optimal transport interpolation:

For a second example here is a uniform PDF interpolating into Gaussian, by lerping the PDFs and then the ICDFs.

As a last set of examples, here is uniform to quadratic. First by lerping the PDFs, then by lerping the ICDFs.

It sounds like a lot of work interpolating functions by integrating the PDFs or PMFs to CDFs, inverting them to ICDFs, lerping, then inverting and differentiating, and it really is a lot of work.

However, like I mentioned before, if you a plug a uniform random value into an ICDF function, it will give you an unbiased random value from the PDF. This means that in practice, if you are generating non uniform random numbers, you often have an ICDF handy for the distribution already; either symbolically or numerically in a table. That makes this whole process a lot simpler: plug your uniform random x value into the two ICDFs, and return the result of lerping them. That’s all there is to it to generate random numbers that are from a PDF interpolated over 1-Wasserstein distance. That is extremely simple and very runtime friendly for graphics and game dev people. Again, more simply: just generate 2 random numbers and interpolate them.

If you are looking for some intuition as to why interpolating the ICDF does what it does, here are some bullet points.

• PDF(x) or PMF(x) tells you how probable x is.
• CDF(x) tells you how probable it is to get the value x or lower.
• ICDF(x) tells you the value you should get, for a uniform random number x.
• Interpolating PDF1(x) to PDF2(x) makes the values in PDF1 less likely and the values in PDF2 more likely. It’s shrinking PDF1 on the graph, and growing PDF2.
• Interpolating CDF1(x) to CDF2(x) gets the values that the two PDFs would give you for probability x, and interpolates between them. If a random roll would get you a 3 from the first PDF, and a 5 from the second PDF, this interpolates between the 3 and the 5.

In this section we’ve talked about linear interpolation between two PDFs, but this can actually be generalized.

1. A linear interpolation is just a degree 1 (linear) Bezier curve, which has 2 control points that are interpolated between. The formula for a linear Bezier curve is just A(1-t)+Bt which looks exactly like linear interpolation, because it is! With a quadratic Bezier curve, you have 3 control points, which means you could have three probability distributions you were interpolating between, but it would only pass through the two end points – the middle PDF would never be fully used, it would just influence the interpolation. You can do cubic and higher as well for more PDFs to interpolate between. Check this out for more info about Bezier curves. https://blog.demofox.org/2015/05/25/easy-binomial-expansion-bezier-curve-formulas/.
2. A linear interpolation is a Bezier curve on a one dimensional 1-simplex aka a line with barycentric coordinates of (s,t) where s=(1-t). The formula for turning barycentric coordinates into a value is As+Bt which is again just linear interpolation. You can take this to higher dimensions though. In two dimensions, you have a 2-simple aka a triangle with barycentric coordinates of (u,v,w). The formula for getting a value on a (linear) Bezier triangle is Au+Bv+Cw where A, B and C would be 3 different PDFs in this case. Higher order Bezier triangles exist too. Have a look here for more information: https://blog.demofox.org/2019/12/07/bezier-triangles/
3. Lastly, Bezier surfaces and (hyper) volumes exist, if you want to interpolate in rectangular domains aka “tensor product surfaces”. More info on that here: https://blog.demofox.org/2014/03/22/bezier-curves-part-2-and-bezier-surfaces/.

To see how to linearly interpolate in other p-Wasserstein distance metrics, give this link below a read. I’ll be doing a post on optimal transport in higher dimensions before too long and will revisit it then.

https://pythonot.github.io/auto_examples/plot_barycenter_1D.html

# Rounding Modes For Integer Division

This is a super quick blog post on how to get different rounding modes with integer division: floor to get an integer smaller than or equal to the real value, round to get the integer that closest to the real value, and ceiling to get an integer greater than or equal to the real value.

These operations are useful when you need to align memory allocations, calculate dispatch sizes for compute shaders, work with objects on a grid, and many other things.

## Unsigned Integers

• Floor: A / B
• Round: (A+B/2) / B
• Ceiling: (A+B-1) / B alternately: (A-1) / B+1 (thanks @clift_m!)
void RoundTest(unsigned int a, unsigned int b)
{
printf("%i / %i = %0.2f\n", a, b, float(a) / float(b));
printf("floor = %i\n", a / b);
printf("round = %i\n", (a + b / 2) / b);
printf("ceiling = %i\n\n", (a - 1) / b + 1);
}


Running the above with a couple different values:

## Signed Integers

Here are routines that handle signed integers, and a testing program for them, courtesy of @insouris.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define N 30

static int div_floor(int a, int b) { return (a ^ b) < 0 && a ? (1 - abs(a)) / abs(b) - 1 : a / b; }
static int div_round(int a, int b) { return (a ^ b) < 0 ? (a - b / 2) / b : (a + b / 2) / b; }
static int div_ceil(int a, int b) { return (a ^ b) < 0 || !a ? a / b : (abs(a) - 1) / abs(b) + 1; }

int main()
{
for (int a = -N; a <= N; a++) {
for (int b = -N; b <= N; b++) {
if (!b)
continue;

const float f = a / (float)b;

const int ef = (int)floorf(f);
const int er = (int)roundf(f); // X.5 to X+1, or -X.5 to -X
const int ec = (int)ceilf(f);

const int of = div_floor(a, b);
const int orn = div_round(a, b);
const int oc = div_ceil(a, b);

const int df = ef != of;
const int dr = er != orn;
const int dc = ec != oc;

if (df || dr || dc)
fprintf(stderr, "%d/%d=%g%s\n", a, b, f, (a ^ b) < 0 ? " (diff sign)" : "");

if (df) fprintf(stderr, "floor: %d != %d\n", of, ef);
if (dr) fprintf(stderr, "round: %d != %d\n", orn, er);
if (dc) fprintf(stderr, "ceil: %d != %d\n", oc, ec);
}
}
return 0;
}



# Calculating SVD and PCA in C++

The last post (https://blog.demofox.org/2022/07/10/programming-pca-from-scratch-in-c/) talked about calculating PCA from scratch, and gave some C++ code which did just that.

The post mentioned that the code was somewhat naïve and shouldn’t be used in situations that matter, because the operations done for PCA were notoriously complex numerically and difficult to make robust. It said that a library called Eigen should be used if doing things like that in C++. This post does just that – it uses the Eigen library to compute SVD (Singular Value Decomposition), and also shows how to calculate PCA from SVD, which is a more robust way of doing it than I showed in the last post. Specifically, PCA from SVD doesn’t need to calculate the covariance matrix, which can be a source of numerical problems.

SVD is pretty closely related to PCA, with the main differences being that the formula is different, and data doesn’t need to be centered for SVD, while it does for PCA.

The 190 lines of C++ code that goes along with this post is at https://github.com/Atrix256/SVD. The code uses the Eigen library, which is a header only library.

## Calculating SVD

This video does a great job of explaining how to calculate SVD. Feel free to give it a watch, but i’ll also explain below. https://www.youtube.com/watch?v=cOUTpqlX-Xs.

SVD takes as input a matrix ($C$) sized MxN (it doesn’t have to be square) and breaks it apart into a rotation ($U$), a scale ($\Sigma$), and another rotation ($V^T$). Any matrix can be decomposed this way:

$C = U \Sigma V^T$

The scaling matrix $\Sigma$ has zeros everywhere except the diagonal. The values on the diagonal are called the “Singular Values” and they also describe how “important” each row of $U$ and column of $V^T$ is. Just like PCA, you can get rid of the ones with the lowest singular values and thus reduce dimensionality of the data. This is called “Truncated SVD”.

So how do you calculate SVD?

First you multiply your (possibly rectangular) matrix $C^T$ with $C$ to make the square matrix $C^TC$.

The eigenvectors of $C^TC$ get put together as column vectors to make the matrix $V$.

The matrix $\Sigma$ has the square roots of the eigenvalues of $C^TC$ along the diagonal.

The matrix $U$ is made by multiplying $C * V$ and normalizing the columns.

That’s all there is to it!

## C++ Code

Using eigen to calculate SVD as described above is pretty straightforward:

Matrix2d C;
C <<
5.0f, 5.0f,
-1.0f, 7.0f;

// Calculate SVD as described in https://www.youtube.com/watch?v=cOUTpqlX-Xs
{
Matrix2d CTC = C.transpose() * C;

EigenSolver<Matrix2d> es(CTC);
EigenSolver<Matrix2d>::EigenvalueType eigenValues = es.eigenvalues();
EigenSolver<Matrix2d>::EigenvectorsType eigenVectors = es.eigenvectors();

auto V = eigenVectors;

MatrixXcd sigma;
{
sigma = MatrixXcd::Zero(eigenValues.rows(), eigenValues.rows());
for (int i = 0; i < eigenValues.rows(); ++i)
sigma(i, i) = sqrt(eigenValues[i].real());
}

MatrixXcd U = C * V;
for (int i = 0; i < U.cols(); ++i)
U.col(i).normalize();

cout << "C = \n" << C << "\n\n";

cout << "U = \n" << U.real() << "\n\n";
cout << "sigma = \n" << sigma.real() << "\n\n";
cout << "V = \n" << V.real() << "\n\n";

cout << "U * sigma * VT = \n" << (U * sigma * V.transpose()).real() << "\n\n";
}


Running that gives this output. The final output of U*sigma*VT shows that putting the decomposed matrices back together gives you the original data.

While the Eigen library has interfaces to let you calculate eigenvalues and eigenvectors, it also has higher level functionality – including performing SVD itself! There are two ways to do SVD in Eigen… one is better for smaller matrices (JacobiSVD), the other is better for bigger matrices (BDCSVD). In this case, I use the one better for bigger matrices, because AFAIK it’s fine for smaller ones too, just a bit slower.

Matrix2d C;
C <<
5.0f, 5.0f,
-1.0f, 7.0f;

// Calculate SVD using the built in functionality
{
BDCSVD<Matrix2d> svd(C, ComputeFullU | ComputeFullV);

auto U = svd.matrixU();
auto V = svd.matrixV();
auto sigma = svd.singularValues().asDiagonal().toDenseMatrix();

cout << "C = \n" << C << "\n\n";

cout << "U = \n" << U << "\n\n";
cout << "sigma = \n" << sigma << "\n\n";
cout << "V = \n" << V << "\n\n";

cout << "U * sigma * VT = \n" << U * sigma * V.transpose() << "\n\n";
}


Running that outputs this, which is the same as before, except the eigenvalues are switched, and one of the eigenvectors was multiplied by -1 (opposite magnitude but still the same eigenvector).

## Calculating PCA With SVD

To calculate PCA using SVD, all you need to do is run SVD and then multiply C with V. Here’s some code where we use the 3×3 Gaussian kernel with sigma 0.3 from last blog post, and run PCA on it again.

// PCA!
{
Matrix3d newC;
newC <<
0.002300, 0.043200, 0.002300,
0.043200, 0.818000, 0.043200,
0.002300, 0.043200, 0.002300;

BDCSVD<Matrix3d> svd(newC, ComputeFullU | ComputeFullV);

auto U = svd.matrixU();
auto V = svd.matrixV();
auto sigma = svd.singularValues().asDiagonal().toDenseMatrix();

cout << "C = \n" << C << "\n\n";

cout << "V = \n" << V << "\n\n";

cout << "sigma = \n" << sigma << "\n\n";

cout << "C * V = \n" << newC * V << "\n\n";

cout << "Principle Direction = \n" << (newC * V).col(0).normalized() << "\n\n";
}


When we do that we get this. The sigma tells us that there is basically only one non zero component (the others are close enough to zero to be considered zero) and the principle direction matches what we found in the last post as well. Looks like it’s working!

## SVD Of Image

Next up, we’ll take an image with size WxH, make a matrix that is Wx(3H) in size, and perform truncated SVD on it, with different numbers of singular values. The matrix is 3 times as tall because we are going to separate each row of RGB pixels into three rows of pixels; one row for each color channel. The code to do this is in the github repo, so go have a look and play around with it if you want to 🙂

First is the old man from legend of zelda 1 in NES. First is the original image, then 1 singular value, then 5%, 10%, 15%, 20%, 25% and 50%.

Next up is a different image, but the same percentages of singular values in the same order.

## Using Truncated SVD In Practice

In the code I set singular values to zero to truncate SVD, but in reality if you wanted to truncate to N dimensions, you would shrink the sigma matrix to be NxN, and you would shrink the U and V matrix both to have N columns. In Eigen, you can do this using the “conservativeResize(rows, cols)” function.

Whether you have truncated or not, the U matrix can be considered your encoded data, and you can multiply sigma by V transpose to make the decoding matrix. Multiplying U by the decoding matrix will give you back your data, or an approximation to it if you truncated.

As a quick overview of what size of compressed data to expect, let’s say you have R rows of D dimensional data (an RxD matrix – R rows of data, and D columns of data).

When you SVD that, U will be RxR, Sigma will be RxR and V will be DxR. Multiplying sigma by V transpose to make the decoding matrix will be RxD. Multiplying U*Sigma*V transpose will be RxD.

Let’s say that you truncate the SVD data from being D dimensional to N dimensional. U will be RxN, Sigma will be NxN, and V will be DxN. Multiplying sigma by V transpose to make the decoding matrix will be NxD. Multiplying U*Sigma*V transpose will still be RxD.

When you have the U and sigma*V transpose matrices and you want to decode a single (scalar) value of data, you get the entire row from U that corresponds to your data’s R dimension, which will be M dimensional, and you get the entire column from the decoding matrix that corresponds to your data’s D dimension, which will also be M dimensional. You then dot product these vectors and have your decoded scalar value. This is useful if you want random access to your encoded data, such as in a shader on the GPU.

Knowing where to truncate to can be done by looking at the sigma values and finding where they taper off enough to your liking. Here is a graph of some SVD’d data where I divided all the singular values by the sum of the singular values to normalize them, and then I graphed the sum of the singular values at or before each dimension. You can see that this data is well fit by 3 or 4 dimensions.

The last 2 columns in this data show “Explained Variance” and “Culmulative Explained Variance” which is another metric to use to see how good a fit is for a certain number of dimensions. You calculate explained variance by dividing sigma squared by the sum of all sigmas squared. For culmulative explained variance, you sum up the sigmas squared at or before the dimension you are looking at, divided by the sum of all sigmas squared.

An intuitive perspective to SVD for data compression is that if you have some data that is AxB in size, SVD will break it into three matrices: an AxA matrix, an AxA scaling matrix, and an AxB matrix. You can multiply the middle scaling matrix into either the left or the right matrix and then end up with only two matrices: one is AxA and the other is AxB. At this point, our data is larger than it used to be, but that scaling matrix told us how important each column of the first matrix / row of the second matrix is. We can remove those for any place that the scaling matrix is zero or close enough to zero to be considered zero. That gives us two matrices AxN and NxB. At this point, if (A*N + N*B) is smaller than (A*B) then we’ve compressed our data in a nearly lossless way.

We don’t have to stop here though, and can start trimming more rows and columns off to make the matrices be AxM and MxB, such that (A*M + M*B) is smaller than (A*B) by as much as we want – but we introduce error into the process and our compression is no longer lossless. The error we are adding is minimized in the “least squares” sense, so it will be many small errors instead of a few large errors, which is often desired in practice.

A matrix multiply is all that is needed to get your data back, but individual values can be read back easily as well (random access). You get the a’th row from the first matrix, and the b’th column from the second matrix. The vectors will both be of size M. You dot product them, and your value is “decompressed”.

If doing something like compressing animation, the first matrix might be indexed by time, and the second matrix might be indexed by bone id*3, having 3 entries for the position of a bone starting at that bone id*3 location in the matrix. Then it’s just a vec3 * mat3x3 multiply (or alternately 3 dot products) to get the position back.

## Shorter Intuition & Practical Concerns (Added 10/2/22)

Here is some shorter intuition for using truncated SVD for data compression.

If you start with data that is W * H in size, this lets you make it into data that is (W + H) * N in size, where N is tuneable for quality vs size. The U matrix will be W * N in size, and the SigmaV matrix will be H * N in size.

To decompress a single value at (X,Y), you read an N sized vector from U, an N sized vector from SigmaV, and dot product them to get the value back.

The downside to this technique is the memory (texture) reads, where for each value you want to decompress you need to read 2*N values to dot product them together.

The Eigen library is pretty neat. You can grab the latest zip from https://eigen.tuxfamily.org/ and after unzipping, the “Eigen” folder within the zip has the files you want to include in your program.

Compilation times are pretty long when including files from this library, which i found surprising. It’d probably be best to put the includes to Eigen within a static library that didn’t get changed often to avoid the compilation times. I also had to turn on \bigobj in msvc after hitting an error regarding it.

Other than that, I’m a fan of Eigen and expect to use it when I need to do things like this, instead of writing from scratch, or calling into python scripts to do these numerical operations (hehe).

Here is a great use of SVD to compress PBR textures:
https://bartwronski.com/2020/05/21/dimensionality-reduction-for-image-and-texture-set-compression/

# Programming PCA From Scratch In C++

When running PCA (Principle Component Analysis) on a data set, it gives back a set of orthogonal axes for that data, as well scalar values telling you the importance of each axis.

When sorting the axes by importance, the first axis is the axis that has the most variation of the data, so is the best approximation of the data with a straight line. The second axis is perpendicular and is the axis that adds the most variation of the error from the first line. The third axis is perpendicular and adds on the most remaining variation, same with the fourth, fifth and so on, up to the dimensionality of your data set.

While you can transform your data to use these axes and reconstruct the data perfectly, PCA is useful because you can decide to throw away some number of the least important axes, thus reducing the dimensionality of the data!

That can be useful for data analysis (reasoning about 2D data is a lot easier than 4D data!), but it can also be a form of data compression, if the error introduced by throwing away one or more dimensions is acceptable.

PCA itself is pretty easy to explain, but to implement it yourself, you need to be able to calculate the eigen values and eigen vectors of a matrix. Those operations are notoriously difficult to do in robust ways, so the common wisdom is to not implement them yourself, but to use software like python or libraries like Eigen (https://eigen.tuxfamily.org/) to do the hard bits for you.

To crack open the black box a bit though, I implemented it in plain C++. It’s about 800 lines including comments, generated all the data in this blog post, and you can check it out here: https://github.com/Atrix256/PCA.

## Calculating PCA

To calculate PCA you…

1. Center the data
2. Calculate the covariance matrix
3. Calculate the covariance eigen values and eigen vectors
4. Transform the data
5. Recover the data

Let’s go through this step by step.

## 1. Center the Data

For this you just calculate the average of your data and subtract the average from all data points.

This makes the data centered on the origin which is important for PCA since it is meant to find the axes of importance in the data – NOT fit a line to the data.

Later on if you want to recover your uncentered data from the PCA encoded data, you’ll need to keep this average around to add it back in. Something to keep in mind if thinking about using PCA for data compression.

## 2. Calculate the Covariance Matrix

If you have N dimensional data, the covariance matrix will be NxN.

Each cell of the covariance matrix i,j will contain the covariance value between all the data in dimension i and dimension j. Since the covariance of a variable with itself is the variance, the diagonal of the matrix will have the variance values of the dimensions.

As a recap, covariance is calculated as:

$cov(X,Y) = \mathbb{E}(XY) - \mathbb{E}(X) * \mathbb{E}(Y)$

Variance is just covariance of a value with itself, so is calculated as:

$var(X) = \mathbb{E}(X^2) - \mathbb{E}(X)^2$

If your data represents a sample of the data instead of the entire data, you’ll need to use sample variance instead, which divides the sums by (n-1) instead of dividing by n.

## 3. Calculate the Covariance Eigen Values and Eigen Vectors

This is the part that is less easy, where we calculate the eigen values and eigen vectors of the covariance matrix. The eigen vectors are the axes of importance we mentioned before, and the eigen values are how important each one is.

First we need to calculate the eigen values.

I calculate them using the “QR Algorithm” (https://en.wikipedia.org/wiki/QR_algorithm) which is an algorithm that gives you approximate eigen values. It works by repeatedly doing QR decomposition (https://en.wikipedia.org/wiki/QR_decomposition) on the matrix and setting the matrix to R*Q for the next iteration. You do this process until a certain number of iterations has occurred, or all non diagonal elements of the matrix are close enough to zero for your liking. When finished, the approximate eigen values are the diagonal values of the matrix.

I loop until the largest non diagonal value is less than 0.0001, but didn’t play around with this very much. You can see this code at https://github.com/Atrix256/PCA/blob/main/vecmath.h#L321.

As part of the QR decomposition, I use the Gram-Schmidt process (https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process) to make orthogonal vectors.

There are other ways to get eigen values, and other choices you can make in the QR algorithm. I didn’t implement shifting, which helps convergence, and I’ve heard that Householder reflections are much better quality, and not that much more difficult to implement.

In any case, this works well enough for me to get good enough eigen values in my test cases.

Now that we have the eigen values, we need to get the eigen vector associated with each one. To do that, I use the inverse iteration algorithm (https://en.wikipedia.org/wiki/Inverse_iteration) which gives you an eigenvector associated with an approximate eigen value. It’s a pretty simple algorithm that just inverts a matrix and then repeatedly puts a randomly initialized vector through it. I do that for 10,000 iterations. The code for it is at https://github.com/Atrix256/PCA/blob/main/vecmath.h#L398.

I also normalize the eigen vectors and re-calculate the eigen values from the newly found eigen vectors to make them match up better. I had to switch from floats to doubles at one point as well, but may have just been doing something silly, like letting numbers get too small without clamping them to zero.

## 4. Transform the data

Sorting the eigen vectors from largest to smallest, you can then make them be columns of a matrix. This is the matrix you use to convert your data into “PCA space”. To reduce dimensionality you can either throw away columns from the right at this point, or you can wait until after transforming the data into the new space. It’s up to you to decide how many columns you can get away with throwing away for your usage case.

## 5. Recover the Data

To recover your data from “PCA space”, you make the eigenvectors for the dimensions that you didn’t throw out as rows in a matrix instead of columns. Multiply the PCA data by this matrix to get the data back minus the information contained in the lost columns.

Not too bad right? I want to re-iterate that while this code works for my handful of test cases on this blog post, it almost certainly is not production ready. It’s just a semi naive implementation for learning purposes.

## Result: Gaussian Blur Kernel

First let’s PCA a 3×3 gaussian blur kernel with a sigma of 0.3.

The filter itself is:

[0.002300, 0.043200, 0.002300]
[0.043200, 0.818000, 0.043200]
[0.002300, 0.043200, 0.002300]

The eigen values and eigen vectors are:

[0.134147] [0.052641, 0.997225, 0.052641]
[0.000000] [0.000000, 0.000000, 0.000000]
[0.000000] [0.000000, 0.000000, 0.000000]

Interestingly, you can see that there’s only one non zero eigenvalue. That means that our data is in fact a rank 1 matrix in disguise! In other words, all the rows are just scalar multiples of each other. We could then replace each row with a single scalar value to multiply the eigen vector by which would reduce our data from 3D to 1D and 1/3 the storage size!

The two rows of zero eigen values is due to the fact that gaussian blurs are separable, so rows (or columns) really are just scalar multiples of each other. This test was inspired by a post from Bart Wronski on how to use SVD to separate non separable filters: https://bartwronski.com/2020/02/03/separate-your-filters-svd-and-low-rank-approximation-of-image-filters/

## Result: 2D Data

Here’s the eigenvectors of some 2D data, then showing it reduced to 1 dimension along the larger eigenvector, then showing it including the smaller eigenvector, which makes it fit the data perfectly.

Here is the same with some different data points.

## Result: 3D Data

I took some 3D data from a page that was real helpful when trying to understand PCA better (https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643):

[90.000000, 60.000000, 90.000000]
[90.000000, 90.000000, 30.000000]
[60.000000, 60.000000, 60.000000]
[60.000000, 60.000000, 90.000000]
[30.000000, 30.000000, 30.000000]

The eigen values and eigen vectors are:

[910.069953] [0.655802, 0.429198, 0.621058]
[629.110387] [0.385999, 0.516366, -0.764441]
[44.819660] [0.648790, -0.741050, -0.172964]

And here is the RMSE for the data when PCA’d to different numbers of dimensions:

[1] 25.960163
[2] 6.694749
[3] 0.000000

At 1 component the data is:

[88.540569, 74.751951, 81.346364]
[72.547174, 64.284878, 66.200303]
[63.419540, 58.311187, 57.556254]
[75.638271, 66.307884, 69.127633]
[29.854445, 36.344100, 25.769445]

At 2 components:

[83.264247, 67.693599, 91.795721]
[90.954764, 88.909466, 29.745464]
[62.525570, 57.115286, 59.326695]
[65.892097, 53.270028, 88.429194]
[27.363322, 33.011622, 30.702926]

At 3 components, it matches, since no dimensionality reduction occured.

[90.000000, 60.000000, 90.000000]
[90.000000, 90.000000, 30.000000]
[60.000000, 60.000000, 60.000000]
[60.000000, 60.000000, 90.000000]
[30.000000, 30.000000, 30.000000]

When you fit PCA to data, it minimizes the distance from the data point to the fit line. This is in contrast to least squares which only minimizes the distance on the y axis. One reason I wanted to dive deep into PCA was to see if it could be used to fit piecewise non linear curves like least squares can. After learning more, that seems to be swimming against the current for PCA, which is more aimed at things like making 3d vectors be storable as 2d vectors. I’ll bet what I was thinking of is possible, but I couldn’t really find info on it. I did find out that “Total Least Squares” exists which does the same kind of fitting though, so I think I should look more into that. Also “non linear dimensional reduction” is a topic as well.

It turns out too that a better way to do PCA than what I’ve done above is to use SVD to do PCA. I plan on doing a post on SVD soon and should be able to talk more about that then.

https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643
https://towardsdatascience.com/principal-component-analysis-explained-d404c34d76e7

Since rolling your own PCA is usually frowned on due to the numerical complexity of things involved, here’s some info about how to do PCA in python (also, in C++ you might want to use that Eigen library i mentioned before)
https://stats.stackexchange.com/questions/235882/pca-in-numpy-and-sklearn-produces-different-results

Check out this link for a neat usage case of PCA for texture compression on textures used in PBR (Physically Based Rendering):
https://bartwronski.com/2020/05/21/dimensionality-reduction-for-image-and-texture-set-compression/

# Piecewise Least Squares Curve Fitting

This post will first talk about how to do equality constraints in least squares curve fitting before showing how to fit multiple piecewise curves to a single set of data. The equality constraints will be used to be able to make the curves c0 continuous, c1 continuous, or higher continuity, as desired.

## Point Constraints

The previous blog post (https://blog.demofox.org/2022/06/06/fitting-data-points-with-weighted-least-squares/) showed how to give weights to points that were fit with least squares. If you want a curve to pass through a specific point, you could add it to the list of points to fit to, but give it a very large weighting compared to the other points, such as a weight of 1000, when the others have a weight of 1.

There is a more direct way to force a curve to go through a point though. I’ll link to some resources giving the derivation details at the end of the post, but for now I’ll just show how to do it.

For ordinary least squares, when making the augmented matrix to solve the system of linear equations it looks like this:

$\begin{bmatrix}A^TA|A^TY\end{bmatrix}$

You can use Gauss Jordan Elimination to solve it for instance (more on that here: https://blog.demofox.org/2017/04/10/solving-n-equations-and-n-unknowns-the-fine-print-gauss-jordan-elimination/)

Doing that solves for c in the equation below, where c is the vector of coefficients of the polynomial that minimizes the sum of squared error for all of the points.

$A^TA*c = A^TY$

For weighted least squares it became this:

$\begin{bmatrix}A^TWA|A^TWY\end{bmatrix}$

Which again solves for c in the equation below:

$A^TWA*c = A^TWY$

If you want to add an equality constraint to this, you make a vector C and a scalar value Z. The vector C is the values you multiply the polynomial coefficients by, and Z is the value you want to get out when you do that multiplication. You can add multiple rows, so C can be a matrix, but it needs to have as many columns as the $A^TA$ matrix does. Z then becomes a vector, which should have as many rows as the C matrix does. The matrix to solve then becomes:

$\left[\begin{array}{rr|r} A^TWA & C^T & A^TWY \\ C & 0 & Z \end{array}\right]$

When solving that and getting the coefficients out, you can ignore the extra values on the right where $Z$ is; those are the values of the Lagrange multipliers (see derivation for more details). It’s just the values where $A^TWY$ is, that become the polynomial coefficients.

Let’s see an example of a point constraint for a linear equation.

A linear equation is $y=ax+b$ where a and b are the coefficients that we are solving for.

in our case, our equation is more generalized into $Z = aC_1 + bC_0$ where our constraint specifies $Z$, $C_0$ and $C_1$.

To make a point constraint, we set $C_0$ to be 1 (aka x value to the 0th power), and we set $C_1$ to be the x value (aka x value to the 1st power). We are making the powers of x for our point just like we do when making the $A$ matrix. Our Z value is the y value we want at that x location.

So, more specifically, if we wanted a linear fit to pass through the point (2, 4), our C vector would be $[1, 2]$ and our Z value would be 4.

If we wanted to give the same point constraint to a quadratic function, the C vector would be $[1, 2, 4]$ and the Z value would still be 4.

For a cubic, the C vector would be $[1, 2, 4, 8]$ and the Z value would still be 4.

If you wanted to constrain a quadratic function to 2 specific points (2,4) and (3, 2), C would become a matrix and would be:

$\begin{bmatrix}1 & 2 & 4 \\ 1 & 3 & 9\end{bmatrix}$

Z would become a vector and would be $\begin{bmatrix} 4 \\ 2\end{bmatrix}$.

Hopefully makes sense.

There are limitations on the constraints you can give, based on the hard limitations of mathematics. For instance, you can’t tell a linear fit that it has to pass through 3 non colinear points.

## Point Constraints vs Weights

Let’s see how weights work versus constraints. First in a linear equation where we fit a line to two points, and have a third point that starts off with a weight of 0, then goes to 1, 2, 4, and then 100. Finally we’ll show what the fit looks like with an actual constraint, instead of just a weighted point.

Here is the same with a quadratic curve.

The constraints let you say “This must be true” and then it does a least squares error solve for the points being fit, without violating the specified constraint. As the weight of a point becomes larger, it approaches the same effect that a constraint gives.

## Derivative Constraint

A nice thing about how constraints are specified is that they aren’t limited to point constraints. You are just giving values (Z) that you want when multiplying the unknown polynomial coefficients by known values (C).

If we take the derivative of a quadratic function $y=ax^2+bx+c$, we get $y'=2ax+b$.

We can use this to specify that we want a specific derivative value at a specific x location. We plug the derivative we want in as the Z value, and our C constraint vector becomes $[0, 1, 2x]$. So, if we wanted our quadratic curve to have a slope of 5 at x=1, then our Z value is 5, and our C constraint vector becomes $[0, 1, 2]$. Below is an image which has 3 points to fit with a quadratic curve, but also has this exact slope constraint:

You can check the equation to verify that when x is 1, it has a slope of 5. Plugging in the coefficients and the value of 1 for x, you get 9.384618 + 2 * -2.192309 = 5.

It made sure the slope at x=1 is 5, and then did a quadratic curve fit to the points without violating that constraint, to find the minimal sum of squared errors. Pretty cool!

## Piecewise Curves Without Constraints

Next, let’s see how to solve multiple piecewise curves at a time.

The first thing you do is break the data you want to fit up along the x axis into multiple groups. Each group will be fit by a curve independently, as if you did an independent curve fit for each group independently. We’ll put them all into one matrix though so that later on we can use constraints to make them connect to each other.

Let’s say that we have 6 points (1,3), (2,4), (3,2), (4, 6), (5,7), (6,5), and that we want to fit them with two linear curves. The first curve will fit the data points where x <= 3, and the second curve will fit the data points where x > 3. If we did that as described, there’d be a gap between the 3rd and 4th point though, so we’ll duplicate the (3,2) point, so that it can be the end of the first curve, and the start of the second curve.

For handling multiple curves, we change what is in the A matrix. Normally for a linear fit of 7 points, this matrix would be 2 columns and 7 rows. The first column would be x^0 for each data point, and the second column would be x^1 for each data point. Since we want to fit two linear curves to these 7 data points, our A matrix is instead going to be 4 columns (2 times the number of curves we want to fit) by 7 rows (the number of data points). The first 3 points use the first two columns, and the last four points use the second two columns, like this:

$\begin{bmatrix}x_0^0 & x_0^1 & 0 & 0 \\ x_1^0 & x_1^1 & 0 & 0 \\ x_2^0 & x_2^1 & 0 & 0 \\ 0 & 0 & x_3^0 & x_3^1 \\ 0 & 0 & x_4^0 & x_4^1 \\ 0 & 0 & x_5^0 & x_5^1 \\ 0 & 0 & x_6^0 & x_6^1 \end{bmatrix}$

When we plug in the x values for our points we get:

$A = \begin{bmatrix}1 & 1 & 0 & 0 \\ 1 & 2 & 0 & 0 \\ 1 & 3 & 0 & 0 \\ 0 & 0 & 1 & 3 \\ 0 & 0 & 1 & 4 \\ 0 & 0 & 1 & 5 \\ 0 & 0 & 1 & 6 \end{bmatrix}$

Multiplying that A matrix by it’s transpose to get $A^TA$, we get:

$A^TA = \begin{bmatrix} 3 & 6 & 0 & 0 \\ 6 & 14 & 0 & 0 \\ 0 & 0 & 4 & 18 \\ 0 & 0 & 18 & 86 \end{bmatrix}$

Multiplying the A matrix transposed by the y values of our points, we get:

$A^TY = \begin{bmatrix} 9 \\ 17 \\ 20 \\ 95 \end{bmatrix}$

Making that into an augmented matrix of $[A^TA | A^TY]$ and then solving, we get coefficients of:

$\begin{bmatrix} 4 \\ -1/2 \\ 1/2 \\ 1 \end{bmatrix}$.

The top two coefficients are for the points where x <= 3, and the bottom two coefficients are for the points where x > 3.

$y=-1/2x+4$ (when x <= 3)

$y= x+1/2$ (when x > 3)

## Piecewise Curves With Constraints

If we wanted those lines to touch at x=3 instead of being disjoint, we can do that with a constraint!

First remember we are solving for two sets of linear equation coefficients, which we’ll call $c_a$ and $c_b$, each being a vector of the two coefficients needed by our linear equation. If we set the two linear equations equal to each other, we get this:

$c_{a1} * x + c_{a0} = c_{b1} * x + c_{b0}$

now we plug in 3 for x, and we’ll add an explicit multiplication by 1 to the constant terms:

$c_{a1} * 3 + c_{a0} * 1 = c_{b1} * 3 + c_{b0} * 1$

Now we move the right side equation over to the left side:

$c_{a1} * 3 + c_{a0} * 1 + c_{b1} * -3 + c_{b0} * -1 = 0$

We can now turn this into a constraint vector C and a value Z! When making the C vector, keep in mind that it goes constant term, then x1 term for the first curve, then constant term, then x1 term for the second curve. That gives us:

$C = [1, 3, -1, -3]$

$Z = 0$

This constraint says “I don’t care what the value of Y is when X is 3, but they should be equal to each other in both linear curves we are fitting”.

Even though the contents of our A matrix changed, the formulation for including equality constraints remains the same:

$\left[\begin{array}{rr|r} A^TWA & C^T & A^TWY \\ C & 0 & Z \end{array}\right]$

If you plug this in and solve it, you get this:

We can take this farther, and make it so the derivative at x=3 also has to match to make it C1 continuous. For a line with an equation of $y=c_1*x+c_0$, the derivative is just $y' = c_1$ so our constraint is:

$c_{a1} = c_{b1}$

$c_{a1} - c_{b1} = 0$

As a constraint vector, this gives us:

$C = [0, 1, 0, -1]$

$Z = 0$

If we put both constraints in… that the y value must match at x=3 and also that the slope must match at x=3, we get this, which isn’t very interesting:

Since you can define a linear function as a point on the line and a slope, and we have a “must do this” constraint on both a specific point (y matches when x=3) and a slope (the slopes of the lines need to match), it came up with the same line for both parts. We’ve jumped through a lot of extra hoops to fit a single line to our data points. (Note that it is possible to say the slopes need to match and don’t say anything about the lines needing to have a matching y value. You’d get two parallel lines that fit their data points as well as they could, and would not be touching).

This gets more interesting though if we do a higher degree fit. let’s first fit the data points with two quadratic functions, with no continuity constraints. It looks like they are already C0 continuous, by random chance of the points I’ve picked, but they aren’t. Equation 1 gives y=1.999998 for x=3, while equation 2 gives 2.000089. They are remarkably close, but they are not equal.

Changing it to be C0 continuous doesn’t really change it very much, so let’s skip straight to C1 continuity (which also has the C0 constraint). Remembering that the derivative of a quadratic $y=ax^2+bx+c$, is $y'=2ax+b$, we can set the derivatives equal to each other and plug in x for 3.

$2 * c_{a2} * x + 1 * c_{a1} = 2 * c_{b2} * x + 1 * c_{b1}$

$2 * c_{a2} * x + 1 * c_{a1} - 2 * c_{b2} * x - 1 * c_{b1} = 0$

$6 * c_{a2} + 1 * c_{a1} - 6 * c_{b2} - 1 * c_{b1} = 0$

That makes our constraint be:

$C = [0, 1, 6, 0, -1, -6]$

$Z = 0$

Plugging that in and solving, we get this very tasty fit, where the quadratic curves are C1 continuous at x=3, but otherwise are optimized to minimize the sum of the squared error of the data points:

## An Oddity

I wanted to take OLS to a non matrix algebra form to see if I could get any better intuition for why it worked.

What I did was plug the symbols into the augmented matrix, did Gauss Jordan elimination with the symbols, and then looked at the resulting equations to see if they made any sense from that perspective.

Well, when doing a constant function fit like $y=a$, a ends up being the sum of the y values, divided by the sum of the x values raised to the 0th power.

$a = \frac{\sum{y_i}}{\sum{x_i^0}}$

This makes a lot of sense intuitively, since the sum of x values to the 0th power is just the number of points of data. That makes $a$ be the average y value, or the expected y value. That checks out and makes a lot of sense.

Moving onto a linear fit $y=c_1 x + c_0$, the $c_1$ term is calculated as:

$c_1 = \frac{\sum{(x_i y_i)} \sum{x_i^0} - \sum{x_i} \sum{y_i}}{\sum{(x_i x_i)} \sum{x_i^0} - \sum{x_i} \sum{x_i}}$

What’s strange about this formula, is that it ALMOST looks like the covariance of x and y, divided by the covariance of x and x (aka the variance of x). Which, that kinda does look like a slope / derivative: rise over run. It isn’t exactly covariance though, which divides by the number of points to get EXPECTED values, instead of sums.

Then, the $c_0$ term is calculated as this:

$c_0 = \mathbb{E}(y) - \mathbb{E}(x) * c_1$

It again ALMOST makes sense to me, in that we have the expected value of y for the constant term, like we did in the constant function, but then we need to account for the $c_1$ term, so we multiply it by the expected x value and subtract it out.

This almost makes sense but not quite. I don’t think taking it up to quadratic would make it make more sense at this point, so I left it at that.

If you have any insights or intuition, I’d love to hear them! (hit me up at https://twitter.com/Atrix256)

## Closing

The simple standalone C++ code that goes with this post, which made all the data for the graphs, and the python script that graphed them, can be found at https://github.com/Atrix256/PiecewiseLeastSquares.

A nice, more formal description of constrained least squares can be found at http://www.seas.ucla.edu/~vandenbe/133A/lectures/cls.pdf.

Here is another nice one by https://twitter.com/sandwichmaker : https://demofox2.files.wordpress.com/2022/06/constrained_point_fitting.pdf

Here is a nice read about how to determine where to put the breaks of piecewise fits, and how many breaks to make: https://towardsdatascience.com/piecewise-linear-regression-model-what-is-it-and-when-can-we-use-it-93286cfee452

# Fitting Data Points With Weighted Least Squares

About 6 years ago I wrote about how to fit curves, surfaces, volumes, and hypervolumes to data points in an “online algorithm” version of least squares – that is, you could update the fit as new data points come in.

In this post, I’m going to show how to adjust the least squares equations to be able to give weights for how important data points are in the fit. We’ll be looking at curves only, and not focusing on the “online algorithm” aspect, but this will work for both surfaces/volumes as well as the online updating.

The simple standalone C++ code that goes with this post can be found at https://github.com/Atrix256/WeightedLeastSquares

If you google weighted least squares, or look for it on youtube, almost every example you are likely to find talks about it being used when the variance of data points are different in different parts of the graph, and that is used to make points contribute more or less to the final fit.

That is a pretty neat usage case, but if you are just trying to fit a curve to data points as a way to compress data or optimize calculations, you probably are not really concerned with the variance of the data points you are trying to fit. Luckily, weighting data points in least squares is pretty simple.

Here is the equation you solve in ordinary least squares (OLS):

$A^TAx = A^Ty$

where A is a matrix that contains all the powers of x of each data point, y is a vector that contains the y values of each data point, and x is polynomial coefficients that we are trying to solve for.

A pretty straightforward way to solve this equation for x is to make an augmented matrix where the left side is $A^TA$, and the right side is $A^Ty$.

$\left[\begin{array}{r|r} A^TA & A^Ty \end{array}\right]$

From there you can do Gauss-Jordan elimination to solve for x (https://blog.demofox.org/2017/04/10/solving-n-equations-and-n-unknowns-the-fine-print-gauss-jordan-elimination/). At the end, the left side matrix will be the identity matrix, and the right side vector will be the value for x.

$\left[\begin{array}{r|r} I & x \end{array}\right]$

Adjusting this formula for weighting involves introducing a weight matrix called W. W has a zero in every element, except along the main diagonal where the weights of each data point are. $W_{ii} = \text{weight}_i$.

There is something called “generalized least squares” (GLS) where the weight matrix is a covariance matrix, allowing more complex correlations to be expressed with non zeroes off of the main diagonal, but for weighted least squares, the main diagonal gets the weights of the data points.

The equation to solve for weighted least squares (WLS) becomes this:

$A^TWAx = A^TWy$

You can make an augmented matrix like before:

$\left[\begin{array}{r|r} A^TWA & A^TWy \end{array}\right]$

And use Gauss-Jordan elimination again to solve for x:

$\left[\begin{array}{r|r} I & x \end{array}\right]$

## Examples

Here are a few runs of the program which again is at https://github.com/Atrix256/WeightedLeastSquares

Starting with 3 data points (0,0) (1,10), (2,2) all with the same weighting, we fit them with a line and get $y = x+3$. It’s interesting to see that data points 0 and 2 each have an error of +3 while data point 1 has an error of -6.

If we set the weighting of the 2nd point to 0.0001, we can see that it absorbs most of the error, because the weighting says that it is 10,000 times less important than the other data points.

Alternately, we can give it a weight of 10.0 to make it ten times as important as the other two data points. The error becomes a lot smaller for that data point due to it being more important.

If we switch to a quadratic curve, all three points can be well fit, with the same weights.

The example code works in any degree and also shows how to convert from these power basis polynomials to Bernstein basis polynomials, which are otherwise known as Bezier curve control points (more info on that here: https://blog.demofox.org/2016/12/08/evaluating-polynomials-with-the-gpu-texture-sampler/). The weighting you see in curves is not the same as the weighting here. When you convert the polynomial to a Bezier curve, it’s an unweighted curve. Weighted Bezier curves are made by dividing one Bezier curve by another and are called “Rational” curves, where unweighted curves are called “Integral” curves. Rational (weighted) curves are able to express more shapes than integral curves can, even when those integral curves are made by weighted least squares.

## Closing

While we used weighted least squares to give different weightings to different data points, the main motivation you find on the internet seems to be about having different levels of confidence in the data points you get. This makes me wonder if we could use this in graphics, using 1/PDF(x) as the weighting in the least squares fitting like we do in importance sampled monte carlo integration. This seems especially useful if using this as an “online algorithm” that was improving a fit as it got more samples, either spatially or temporally.

# Sampling Importance Resampling

The last post was on weighted reservoir sampling, one of the main techniques used in the 2020 paper “Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting” (https://research.nvidia.com/sites/default/files/pubs/2020-07_Spatiotemporal-reservoir-resampling/ReSTIR.pdf). This post is on the other main technique used, which is sampling importance resampling or SIR.

SIR allows you to have a list of items from probability distribution A, and draw items from it from probability distribution B.

The C++ code that goes with this post is at https://github.com/Atrix256/ResampledImportanceSampling

## SIR Algorithm

The algorithm starts with a list of values drawn from probability distribution A, assigning each value a weight of 1.

Next, you divide the weight of each item by the PDF for that item in distribution A.

You then multiply the weight of each item by the PDF for that item in distribution B, the target distribution.

Then you normalize the weights of all items in the list so they sum to 1.0.

When you use that normalized weight as a probability for choosing each item, you are drawing values from probability distribution B, even though probability distribution A was used to generate the list.

## Example and Intuition

An example really helps understand this I think.

4, 5, 5

Choosing an item from the list is twice as likely to give you a 5 than a 4 because there are twice as many of them. When we do uniform sampling on that list, it’s like we are doing weighted sampling of the list, where all the weights are 1. let’s put those weights in the list.

4 (1), 5 (1), 5 (1)

The probability for choosing each value in that list is this:

4: 1/3

5: 2/3

You divide the weights by those probabilities to get:

4 (3), 5 (3/2), 5 (3/2)

Or

4 (3), 5 (1.5), 5 (1.5)

Which we can normalize to this:

4 (0.5), 5 (0.25), 5 (0.25)

If you do a weighted random selection from this list using those weights, you can see that you are now just as likely to get a 4, as you are a 5. So, dividing the weights by the probabilities gave us a uniform distribution.

Now, let’s say our target distribution is that choosing a 4 should be three times as likely as choosing a 5. Here are the probabilities for that:

4: 0.75

5: 0.25

We now multiply the weights by these probabilities and get this:

4 (0.375), 5 (0.0625), 5 (0.0625)

Normalizing those weights, we get this:

4 (0.75), 5 (0.125), 5 (0.125)

If you sample from this list using those weights, there is a 75% chance you’ll get a 4, and a 25% chance you’ll get a 5.

So, even though we made this list by sampling such that a 5 was twice as likely as a 4, we can now sample from it so that a 4 is three times more likely than a 5.

The intuition here is that by dividing the weights by the probabilities that generated the list, we made the list uniform. When we multiplied the weights by the probabilities of the new distribution, it made the list take on that distribution. Not too complicated right?

## Results

Here we use this method to generate 100,000 uniform random float values between -1 and 1, and use this technique to sample 10000 sigma 0.1 Gaussian random numbers from it. This is the starting histogram, and the sampling importance resampled values histogram.

Here we start with a sigma 1.0 Gaussian distributed list of values, and resample it into a y=3x^2 PDF.

It doesn’t always work out though. This method can only output values that are in the starting list, so if the starting list doesn’t cover the full range of the target PDF, you won’t get values in the full range of the target PDF. Here we generate uniform random values between -0.1 and 0.3, and then try to resample that into a sigma 0.1 Gaussian. We can only output numbers that exist between -0.1 and 0.3, which isn’t the full range of the Gaussian.

This technique can also hit problems if there aren’t enough unique values in the source list. Here we generate only 500 uniform random samples, instead of the 100,000 that we did for the previous tests, and try to turn it into a Gaussian.

Here we do the non failure uniform to gaussian test in reverse with the same number of start and end samples. It seems as though the rare events at the tail ends of the Gaussian really explode…

## Other Stuff

When doing the SIR algorithm, we normalized the weights to be able to a random selection from the list. The normalized weight of an item was it’s probability of being chosen.

This means that you have to do multiple passes through the list: Once to get the sum of the weights, then again to normalize them. Now you can select items from the list randomly.

The last blog post showed us how to do a fair random selection of a weighted list in one pass though, where the weights didn’t need to be normalized. So yep, if you use reservoir sampling to sample from the weighted list, you don’t need to normalize the weights, and you can do the random weighted selection without any preprocessing. You can even calculate the weight for the item the very first time you see it (like as the result of a ray trace?), as you (fairly) consider it for weighted random selection. How cool is that?

Further, the list of samples you choose to take before resampling the importance sampling might be uniform, but it doesn’t need to be. Maybe you could do something like MIS, where you select items using one distribution, then of that distribution, you find the best item using a second distribution? Not sure if that makes sense or not.

Lastly, while talking about uniform and non uniform distributions, white noise is implied, but it need not be white noise. Gaussian blue noise is a thing, or uniform red noise. There should also be ways to use different noise colors on the input and the output, or do weighting (scoring) to promote low discrepancy or other scoring functions that have some sort of memory or awareness of other samples.

That got off onto a tangent a bit, but the point is, there is a lot of possibility here.

If you understood this post and the last, and want more, give the paper a read. You’ll be able to understand what it’s talking about, and it goes a bit deeper into some related topics and implementation details.

“Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting” https://research.nvidia.com/sites/default/files/pubs/2020-07_Spatiotemporal-reservoir-resampling/ReSTIR.pdf

# Picking Fairly From a List of Unknown Size With Reservoir Sampling

There was an awesome graphics paper in 2020 that put together some fun and interesting statistics ideas to try and solve the “many lights” problem of ray tracing. The many lights problem is that when you are trying to shade a pixel on a surface, you technically need to shoot a ray at every light to see if it is shining onto the surface or not. This is relatively ok for a handful of lights, but becomes a big problem when you have 100,000 lights. The method in the paper has each pixel try a couple samples, share what was found with neighbors, and share with itself next frame to have some memory between frames so that it can improve where it samples over time.

The paper is a good read and you should check it out: “Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting” https://research.nvidia.com/sites/default/files/pubs/2020-07_Spatiotemporal-reservoir-resampling/ReSTIR.pdf

In this post, we are going to be talking about one of the methods the paper uses, which is reservoir sampling. This allows you to stream a list of data and fairly (uniform randomly) choose an item from the list as you go. This also works when you want to choose items in the list with different probabilities. You can also just choose a random subset of the list to look at, do the same process, and the results aren’t bad.

This is useful in the many lights problem because it allows a random subset of lights to be considered by each pixel, each frame, and over time they find the best lights to sample with limited ray counts.

There are lots of other uses for reservoir sampling though, within rendering and game development, and outside of it too.

The 237 line single file C++ program that made the data for this post can be found at: https://github.com/Atrix256/WeightedReservoirSampling

## Uniform Reservoir Sampling

Choosing an item uniform randomly from a streaming list of unknown size is surprisingly easy. Starting at index 1, the chance of selecting an item is 1/index. This means that index 1 is chosen when seen, then there is a 50% chance (1/2) to select index 2 instead. After that, there is a 33% chance (1/3) to select index 3 instead, and a 25% chance (1/4) to select index 4, and so on.

The mathematics of why this works is illustrated well in this youtube video: https://www.youtube.com/watch?v=A1iwzSew5QY

int index = 0;
int chosenValue = 0;

while(MoreValuesAvailable())
{
index++;
int nextValue = GetNextValue();
float chance = 1.0f / float(index);
if (GetRandomFloat01() < chance)
chosenValue = nextValue;
}

## Weighted Reservoir Sampling

Adding non uniform sampling to the algorithm is not much more difficult. The chance to choose a specific item becomes the weight of that item, divided by the sum of the weights of the items seen so far. The weights don’t have to be a normalized PDF or PMF, you can use weights of any scale. If you use a weight of “1” for all the items, it becomes the same as uniform reservoir sampling.

float weightSum = 0.0f;
int chosenValue = 0;

while(MoreValuesAvailable())
{
item nextValue = GetNextValue();
weightSum += nextValue.weight;
float chance = nextValue.weight / weightSum;
if (GetRandomFloat01() < chance)
chosenValue = nextValue.value;
}

## Subset Uniform Reservoir Sampling

If you are trying to uniform randomly choose an item from a list that you know the size of, but is just too big, you can do uniform reservoir sampling on a random subset of the list. The larger the random subset, the better the results, but smaller subsets can work too – especially if you are amortizing a search of a larger subset over time.

int chosenValue = 0;

for (int index = 1; index <= sampleCount; ++index)
{
int nextValue = GetRandomListValue();
float chance = 1.0f / float(index);
if (GetRandomFloat01() < chance)
chosenValue = nextValue;
}

## Subset Weighted Reservoir Sampling

Adding weighting to the subset reservoir sampling is pretty easy too.

float weightSum = 0.0f;
int chosenValue = 0;

for (int index = 1; index <= sampleCount; ++index)
{
item nextValue = GetRandomListValue();
weightSum += nextValue.weight;
float chance = nextValue.weight / weightSum;
if (GetRandomFloat01() < chance)
chosenValue = nextValue;
}

## Combining Reservoirs

If you have multiple reservoirs, they can be combined. The paper mentioned uses this to combine last frame reservoir samples with current frame reservoir samples.

void CombineReservoirs(int chosenValueA, float weightSumA, int chosenValueB, float weightSumB, int& newChosenValue, float& newWeightSum)
{
newWeightSum = weightSumA + weightSumB;
float chanceA = weightSumA  / newWeightSum;
if (GetRandomFloat01() < chanceA)
newChosenValue = chosenValueA;
else
newChosenValue = chosenValueB;
}

For uniform sampling in the above, weightSum is the number of items seen. It’s as if every item had a weight of 1.

## Choosing Multiple Items

If you want to choose N items instead of just 1, you could run the single item selection code N times in parallel to choose those N items (Note: you could have the same item chosen multiple times). If you do that, you’ll notice that the weight sum / count will be the same for each of them, so you don’t actually need to store that per each and it can be shared. That will give you something like this for uniform reservoir sampling.

int index = 0;
int chosenValues[N];

while(MoreValuesAvailable())
{
index++;
int nextValue = GetNextValue();
float chance = 1.0f / float(index);
for (int i = 0; i < N; ++i)
{
if (GetRandomFloat01() < chance)
chosenValue[i] = nextValue;
}
}