What I wish I knew about ESPP and RSUs sooner (company stock benefits. 2024, USA, California)

Disclaimer: This is based on information I believe to be true, in 2024, in the United States, and in the state of California specifically. Laws change over time and there are subtleties to all these things, so please talk to a financial professional before taking specific actions. I do hope this helps illuminates how to get the most out of these benefits though.

If you are a programmer (and mhigherany other fields), you are likely to encounter an Employee Stock Purchase Plan (ESPP) at your work. If you work in an in demand area, you are also likely to encounter Restricted Stock Units (RSUs) as well. This article aims to help illuminate some things I wish I knew / realized earlier in my career.

Let’s talk about how these work.

How Do ESPPs Work?

ESPPs let you deduct money out of your pay check automatically, into an account, where periodically (like once a quarter) it is used to buy company stock. There is commonly a “look back” window of time where the price you pay per share of the stock is the lowest price of the stock in that window of time. Also commonly, the stock is offered at a discount on top of that, say 5%, 10% or 15% off.

The money taken out of your paycheck is post-tax, meaning you pay income tax on the money you use to buy the stock.

When you sell the stock, the amount of discount you had on buying the shares become income, which follows regular tax income rules for both federal and state taxes. If the stock has gone up, capital gains applies to the increased value.

Example:

Let’s say you had 1700$ in your account when it was time for the stock buy to happen. Let’s say the current price of the stock was 130$, but that there was a 6 month “look back” window, and the lowest price of the stock in the last 6 months was 100$. You get to purchase stock at 100$, instead of at the 130$ it currently costs. Buying the stock already gives you a profit of 30$ a share! On top of this, let’s say you get a 15$ discount. That 15% discount applies to the 100$ price, meaning you get to buy shares at 85$ instead of the current market price of 130$. That allows you to buy 20 shares of stock for that 1700$, instead of the 13 you could buy at the current market price.

Let’s say you hold onto the stock for a while and it goes up to 250$ a share, and then you decide to sell.

Since you bought 20 shares of stock at 85$, when the stock was actually 130$ a share, that discount of 45$ per share becomes taxable income. 45$ * 20 shares = 900$ of increased taxable income on your tax return, to perhaps push you into a different tax bracket.

However, the stock also went up 120$ a share. 120$ * 20 shares = 2400$ of “capital gains”. In the united states, for federal taxes, if you’ve held this stock for less than 1 year, you pay “short term capital gains” tax, which means the capital gains are treated as regular income, that can push you into a different tax bracket and all of that. If you’ve held onto it for longer than a year, you pay “long term capital gains” which is taxed based on your income, but most folks will be in the 15% bracket.

In the state of California, there is no distinction between long and short term capital gains. The gains all become taxable income.

There are also limits to how much money you can contribute to ESPP. In the USA it’s currently 25,000$ total. https://communications.fidelity.com/pdf/refund-of-espp-contributions.pdf

More information about:

ESPPs: https://www.fidelity.com/learning-center/personal-finance/what-is-an-ESPP
USA Federal Capital Gains: https://www.fidelity.com/learning-center/smart-money/capital-gains-tax-rates and https://www.bankrate.com/investing/long-term-capital-gains-tax/
California Federal Capital Gains: https://www.roberthalltaxes.com/blog/news/california-capital-gains-tax-in-2024-the-ultimate-guide/ and https://states.aarp.org/california/state-tax-guide

How Do RSUs work?

RSU stands for “Restricted Stock Unit” and they are stock that is promised to you that vests (becomes yours) over a specified schedule. For instance, you may be given 480 RSUs when you are hired, with 120 vesting (being given to you) at your 1 year anniversary, and 30 more every quarter for the 3 years after that until all are 480 shares are vested.

When the stock vests, it becomes yours, but the value of the stock becomes taxable income that you pay that year. Whenever you sell the shares and if the value has gone up, you then have to pay short term gains if you’ve held the stock less than a year, or long term if you’ve held it more than a year. Same as ESPPs.

More information about RSUs: https://www.investopedia.com/terms/r/restricted-stock-unit.asp

Why Would a Company Offer These Benefits?

Lets look at some motivations of these benefits. Reality is complicated per usual, with a mix of both optimistic and cynical truths.

ESPP

An optimistic viewpoint of ESPPs is that it’s a way to invest in the company because you believe it’s going to do well. The company thinks so too and is helping you out by giving these shares to you at a discount. How awesome!

A more cynical viewpoint is that any money a company pays an employee that goes into the ESPP program never actually left the company. It can effectively reduce payroll costs, turning some of those guaranteed payroll funds into not guaranteed stock holder returns, where the employee takes on the risk. If the company does well, yeah, the employees with ESPP will cash out some, but there will be extra money around to fund that. If the company does poorly, the employees with ESPP will lose, and get their money back at a reduced rate, which is great for the company, effectively having to pay less payroll if things are going poorly.

RSUs

An optimistic viewpoint here is that a company is going to do well, so giving you stock is going to benefit you in the future when the stock goes up.

A reality of the vesting schedule is that you are less likely to leave if you have significant amount of stock that hasn’t vested yet. It helps retention.

A more cynical viewpoint is that RSUs are better for the company than a cash bonus, because the money is still invested in the company (they didn’t actually have to pay anything out). Also, you are given shares, not dollar amounts, so the risk of how much that is actually worth goes to you, the employee.

Bottom Line

The bottom line is that giving stock is a way of not actually giving anything in the short term, and putting any longer term risk on you, the employee. It’s a pretty clever business move frankly. Despite this, these benefits are still beneficial to you – you will get that money eventually!

How Can I Take Most Advantage of These Benefits?

So, companies definitely have an angle on these benefits and are certainly going to do everything in their power to get the most out of it from their side. They are likely to offer educational material that lists the benefits that help them, but not say the things I’m about to say below. So how can you make the most of it for yourself?

ESPPs

If you believe the stock is going to go up, and you want the stock, go ahead and buy the stock! Max out the amount you contribute if you can afford it and sell when you are ready.

If you don’t believe the stock is going to go up, another way of thinking about ESPPs is that they are allowing you to buy money at a discount. If your ESPP program gives you a 15% discount on the stock, that means you only pay 85 cents for every dollar of stock you buy. This stock is paid for with money you are already paying income tax on so buying this stock doesn’t do anything special for you as far as taxes are concerned. You are welcome to then IMMEDIATELY sell the stock, right after purchase and get your dollar back for every 85 cents you paid. It’s true that you then have to take the 15 cents on the dollar as income and pay income tax on it, but that is the same as if you got a raise. If you immediately sell your ESPP stock after buying it, you are essentially giving yourself a raise. Why wouldn’t you do that? And why wouldn’t you max out how much you were contributing to ESPP to max out your raise?

Something to watch out for though is that if the stock is sold to you a lot lower because of the “look back” window, the difference between what you bought it for and what you sold it for becomes “capital gains”, with the short term vs long term implications I explained above. It might be that you prefer to hold onto that stock for a year before selling, to pay a smaller amount of tax on it. If this issue comes up, it’s a good problem to have though. The problem is basically “What do I do with all this money?!”

In short, if you don’t want the stock, and the stock is flat, or decreasing in value, you can max out your ESPP, and sell immediately after the buying event, to give yourself a raise.

RSUs

Again, if you believe the stock is going to go up, and you want the stock, go ahead and keep these RSUs.

If you don’t believe the stock is going to go up, what should you do?

When your RSUs vest, you pay income tax on the value they are at the time of vesting. This is equivalent to the company paying you more, and you go out and buy stock with that extra pay. If this wasn’t your plan for the money, sell it right when it vests! Convert it back into cash. The stock isn’t going to have changed price (except normal market fluctuations), so you aren’t going to have any gains. Take the cash and walk away, as if the stock purchase never happened. This is fine and doesn’t affect your taxes at all.

Furthermore, let’s say you are working at a company where you either don’t have any more RSUs waiting to vest, or you only have a small amount remaining. If you are in a position where you were given RSUs, you can likely go to a competitor, and get more RSUs from them as a signing bonus. Even if you have the same title, job function, and pay, go get some RSUs from the competition and look out for #1 (yourself). When a company is interested in retaining employees, they will offer them more RSUs to keep this vesting train going. If your company is not doing that for you, they are either not very smart, or they are not interested in keeping you. Either way, it might be a good time to jump elsewhere 🙂

Watching The Stock Market

When you have a lot of money in the stock market, it can be nerve wracking. Having “decent fractions of your pay” in the stock market is a lot of money IMO. You might feel like you need to be constantly watching the market to protect your money, in case there is a big spike up or a big spike down.

A nice thing to help this is there are things called “stop orders” and “limit orders” that can help automate some simple tasks to help you not be constantly glued to stock tickers. There are also more complicated things such as “stop limit” orders.

For instance, you can say “Sell this much stock if the price drops below 400$ a share” with a stop order. Sometimes stock can be volatile, and it could drop down to 200$ a share for a moment, then jump back up to 450$ a share. With a stop order, your stock might sell during that moment when it dropped down to 200, and you get sort of screwed, selling your stock too cheaply and losing money. A stop limit order lets you set a minimum price so you can say “Sell this much stock if the price drops below 400$ a share, but don’t sell if it goes below 350$ a share”.

With something like that in place, you can happily go off and have lunch without having to check your phone every few minutes just in case LOL.

More info about these things: https://www.schwab.com/learn/story/3-order-types-market-limit-and-stop-orders and https://www.investopedia.com/terms/s/stop-limitorder.asp

Conclusion

ESPP and RSUs can be nice perks when working for companies, but it’s important to know how these mechanisms work, to help make the best decisions for yourself and your own situation.

If you want the stock, by all means, keep it! Otherwise…

ESPPs can be a way to give yourself a raise. You deserve it, give yourself that raise!
RSUs are as if you were given extra pay, but the company bought stock with it before they gave it to you. Sell that stock if you weren’t going to buy stock with that money anyways, and take that cash, for the same tax implications.

Anything you think I should add to this, hit me up on mastodon at https://mastodon.gamedev.place/@demofox

Gradient Descent With Adam in Plain C++

The code that made the data and diagrams for this post is plain C++ and can be found at https://github.com/Atrix256/Adam

I recently published a set 3 of articles on machine learning in plain C++ for game developers, as well as a video version of the same content. This article explains an improvement to plain gradient descent called Adam and can be read as a follow up to those articles.

Part 1: https://www.ea.com/seed/news/machine-learning-game-devs-part-1
Part 2: https://www.ea.com/seed/news/machine-learning-game-devs-part-2
Part 3: https://www.ea.com/seed/news/machine-learning-game-devs-part-3

45 minute video: https://www.youtube.com/watch?v=sTAqWRsEiy0

Adam Motivation

Let’s say we wanted to find the values of the x parameters in the function below that gives the lowest y value:

$y=f(x_1,x_2,x_3,...,x_n)$

This is an optimization problem and functions of this form are called cost functions. Gradient descent is a method for doing this optimization, where you start with a random set of x values, and “roll down hill” by making small adjustments to those x values that lead to lower y values. You can stop either after a specific number of steps, or when the y value is below some threshold.

The gradient of a function is a vector of the partial derivatives of the x parameters, and the vector as a whole points the steepest direction up hill on the surface of the function. If you want to go the steepest direction down hill, you just go in the opposite direction.

A challenge is that the steepest path down hill is most often a curved path, not a straight line. To deal with this, you can take small steps down hill and re-evaluate the gradient at each step to figure out the direction to step in next. In machine learning, this step size is called the learning rate and is often called α (“alpha”) and is tuned by hand.

Smaller steps give you a more accurate result, but may be computationally wasteful, taking many small steps when you could take fewer, larger steps towards the goal. This makes for longer than necessary computation (or training) times.

Larger steps might be able to get you nearer to the goal faster, but if the minimum value of the function lies in a deep, narrow canyon, large steps won’t be able to get you into the deepest part of the function.

Having to adjust this learning rate by hand, and having the appropriate learning rate change as you do gradient descent is one of the two problems Adam addresses, by turning down the learning rate when the terrain gets more complicated / bumpy.

Another challenge with gradient descent is that it’s possible to get stuck in a shallow valley when there is a much deeper canyon waiting to be found. This is called a local minima, because while it’s the deepest point locally, it isn’t the globally deepest part of the function, or the global minimum.

A common way of dealing with this is to add momentum to our imaginary ball rolling down hill, so that it can escape local minima and hopefully find deeper, more global minima. If not, it should just roll back into the minima it did find. Momentum is the second benefit Adam gives over plain gradient descent.

Adam is a popular, and simple, method of improving gradient descent and is heavily used in machine learning, which uses gradient descent to train neural networks.

Adam

Adam has 4 parameters, but only alpha usually needs to be tuned.

Alpha – This is essentially the learning rate.
Beta1 – The rate of exponential moving average of the derivative (First Moment). Default is 0.9.
Beta2 – The same for the derivative squared (Second Moment). Default is 0.999.
Epsilon – a value to prevent division by zero. Default is 1e-8.

Adam works by keeping an exponential moving average of the derivative (the first moment of the derivative), as well as a moving average of the derivative squared (the second moment of the derivative), each with their own smoothing factor. In this way, this gives Adam an imperfect memory of the derivatives it has seen, tending to forget older derivative values, and weight newer derivative values more highly. The squared derivative moving average is also more sensitive to large outliers.

The code for doing this is straight forward:

		static const float c_beta1 = 0.9f;
		static const float c_beta2 = 0.999f;

		m = c_beta1 * m + (1.0f - c_beta1) * derivative;
		v = c_beta2 * v + (1.0f - c_beta2) * derivative * derivative;

Interestingly, the m calculation is the same as doing a lerp from m to derivative with a t value of 0.1. If you work in real time rendering, that should look familiar, as that is how TAA (Temporal Anti Aliasing) works as well, also often using a value of 0.1. v works similarly, but uses a t value in the lerp of 0.001.

So, m and v are a moving average of recent derivative and derivative squared values respectively.

The learning rate alpha is then multiplied by m and divided by the square root of v, to calculate an adjusted derivative to use for gradient descent.

		float adjustedDerivative = alpha * m / sqrt(v);

if the derivative was a constant, m would be that constant value, and so would sqrt(v), which means alpha would be the amount to adjust the parameter by. if m was bigger than sqrt(v), it would increase the learning rate, else if m is smaller than sqrt(v), it would decrease the learning rate. Through this simple operation, Adam gives both momentum to escape local minima, as well as the ability to slow down when there are more details to explore on the surface of a function.

We aren’t quite done though. If you work in graphics, you likely have noticed that when using TAA, the first several frames rendered are darker than they should be, before enough frames have been processed. This same problem occurs in Adam, meaning that when it’s first starting out, both m and v are biased towards zero. Adam does bias correction to account for this, essentially making the numbers larger (or for TAA, it’d make the pixels brighter). It calculates an “m hat” and a “v hat”, and that is what is used to calculate the learning rate instead of the raw m and v. Since v hat could possibly be zero, we have to account for that in the division.

		static const float c_epsilon = 1e-8f;

		float mhat = m / (1.0f - std::pow(c_beta1, sampleCount));
		float vhat = v / (1.0f - std::pow(c_beta2, sampleCount));

		float adjustedDerivative = alpha * mhat / (sqrt(vhat)+c_epsilon);

Putting it all together, here is a small struct you can use to put a derivative through Adam to get an adjusted derivative. You would have one of these per parameter to your function, so that Adam was able to adjust the learning rate of each parameter individually.

// Note: alpha needs to be tuned, but beta1, beta2, epsilon as usually fine as is
struct Adam
{
	// Internal state
	float m = 0.0f;
	float v = 0.0f;

	// Internal state calculated for convenience
	// If you have a bunch of derivatives, you would probably want to store / calculate these once
	// for the entire gradient, instead of each derivative like this is doing.
	float beta1Decayed = 1.0f;
	float beta2Decayed = 1.0f;

	float GetAdjustedDerivative(float derivative, float alpha)
	{
		// Adam parameters
		static const float c_beta1 = 0.9f;
		static const float c_beta2 = 0.999f;
		static const float c_epsilon = 1e-8f;

		// exponential moving average of first and second moment
		m = c_beta1 * m + (1.0f - c_beta1) * derivative;
		v = c_beta2 * v + (1.0f - c_beta2) * derivative * derivative;

		// bias correction
		beta1Decayed *= c_beta1;
		beta2Decayed *= c_beta2;
		float mhat = m / (1.0f - beta1Decayed);
		float vhat = v / (1.0f - beta2Decayed);

		// Adam adjusted derivative
		return alpha * mhat / (std::sqrt(vhat) + c_epsilon);
	}
};

An interesting aspect of Adam is that you can see that it adjusts the step size (alpha) by saying “be bigger” or “be smaller” but it never gives a specific value that is an optimal step size, and you still have to tune the learning rate value by hand. To me, that looks like something we’ll likely have a better answer to in the future.

I also find it interesting that adam calculates an adjusted learning rate (alpha) and doesn’t multiply the derivative by that, but just uses that for the derivative instead. That makes it so while the length of a function’s gradient may change from location to location, this actual step size is more fixed, and is just affected by how the gradient changes over time.

Another thing people do with gradient descent is start with a larger learning rate, and decrease it over time. This helps the gradient descent move quickly in the beginning to get closer to the global minimum, and then as it slows down, it helps it refine the search. Interestingly, some people do this ALONG SIDE of Adam, which to me makes some sense. Since Adam can only slow down or speed up a learning rate, but not set it to a specific value, it seems like it could be a good idea to decrease the learning rate over time as well. But, then you have another hyper parameter to tune – how much should alpha decrease each iteration?

Results

As a test, I’ve compared Adam to plain gradient descent, on finding the minimum height location of a 2D terrain made of randomized Gaussians. The code that made these diagrams and data can be found at https://github.com/Atrix256/Adam

The difference between the two methods is not as big as I’d hoped, but I think Adam does better in higher dimensions (more than the 2 parameters of x and y), where the per parameter (per dimension) adjustable training weight means Adam becomes more maneuverable. In this case for only 2 dimensions, I’m not real impressed. Here’s a relevant quote from Andrej Karpathy (https://karpathy.github.io/2019/04/25/recipe/) though:

Below are graphs to show how gradient descent (GD) with different learning rates compares to Adam with different alpha parameters, and each to themselves. Also below are animated gifs showing several points of each type searching for the minimum. Gradient descent is on the left, with Adam on the right. The colored bars in the center show the relative average heights of each particle type.

In the animated images, I used only 16 greys to show the Gaussians, to make it easier to read as a topo map, but the actual heights being optimized against are continuous.

Seed = 3684993765

Seed =181763002

Seed = 1730885985

Making Blue Noise Point Sets With Sliced Optimal Transport

This post explains how to use sliced optimal transport to make blue noise point sets. The plain, well commented C++ code that goes along with this post, which made the point sets and diagrams, is at https://github.com/Atrix256/SOTPointSets.

This is an implementation and investigation of “Sliced Optimal Transport Sampling” by Paulin et al (http://www.geometry.caltech.edu/pubs/PBCIW+20.pdf). They also have code to go with their paper at https://github.com/loispaulin/Sliced-Optimal-Transport-Sampling.

If you missed the previous post that explained sliced optimal transport, give it a read: https://blog.demofox.org/2023/11/25/interpolating-color-image-histograms-using-sliced-optimal-transport/

Generating Points in a Disk

Let’s say we wanted a set of points randomly placed in 2D, but with even density – aka we wanted 2D blue noise points in a square.

We could start by generating 1000 white noise points, then using sliced optimal transport to make them evenly spaced:

Project all points onto a random unit vector. Calculate how much to move each point to make them be evenly spaced on that 1D line, in the same order.
Calculate the average of that process done 64 times, then move the points that amount.
Repeat this process 1000 times.

Doing that, we start with the points on the left and end up with the points on the right.

That looks like good blue noise and only took 0.56 seconds to generate, but confusingly, the points are on a disk and are denser near the edges. We’ll make the density uniform in this section, and will make the points be in a square in the next section.

The reason we are getting a disk is because we are projecting the points onto random directions and making the points be on that line, up to a maximum distance away from the origin. That is forcing them to be inside of a disk.

The reason the points are denser near the edges is because we are making the points evenly spaced on the 1D lines, but there is less room vertically on the disk near the edges. That means the points have less space to occupy at the edges, causing them to bunch up.

To account for this, we need to make there be more points in the center, and fewer at the edges. More specifically, we need each point to claim an even amount of disk area, not an even amount of line width.

Below shows a circle divided into evenly spaced slices on the x axis (top), and slices of equal area (bottom). (diagram made at https://www.geogebra.org/geometry)

We can do this in a couple of steps:

Get the formula for the height of the disk has at a given x position. We’ll call this y=f(x).
Integrate that function to get a y=g(x) function that tells you how much area the disk has at, and to the left of, a given x position.
Divide g(x) by the area of the disk to get a function h(x) that gives a value between 0 and 1.
Invert h(x) by changing it from y=h(x) to x=h(y) and solving for y. That gives us a function y=i(x).
Plug the evenly spaced values we were using on the line into the function i, to get positions on the line that give us equal values of disk area.

This might look more familiar from a statistics point of view. f(x) is an unnormalized PDF, g(x) is an unnormalized CDF, h(x) is a CDF (normalized), and i(x) is an inverse CDF.

If you want a refresher on inverting a CDF, give this a read: https://blog.demofox.org/2017/08/05/generating-random-numbers-from-a-specific-distribution-by-inverting-the-cdf/

If we have a disk centered at (0,0) with radius 0.5, the height of the disk at location x can be found using the Pythagorean theorem. We know x, and the hypotenuse is the radius 0.5, so $y=\sqrt{0.25-x^2}$ . That only gives us the top half of the disk height, but we can double it to get the bottom half as well: $f(x)=2*\sqrt{0.25-x^2}$ .

We can take the integral of that to get $g(x)=x \sqrt{0.25-x^2} + 0.25 \sin^{-1}(2x)$ . Dividing that by the area of the disk, we get $g(x)=(x \sqrt{0.25-x^2} + 0.25 \sin^{-1}(2x)) * \frac{4}{\pi}$

Instead of inverting that function, I made a table with 1000 (x,y) pairs of x and y=h(x). In the table, x is between -0.5 and +0.5, but I also made y be between -0.5 and +0.5. That is a bit different than a CDF where y goes from 0 to 1, but this is just a convention I’m choosing; a standard CDF would work as well. To evaluate i(x), i find the location where the given value shows up in the table as a y value, and give back the x that gave that y value, using linear interpolation.

The sliced optimal transport sampling paper also used a numerical inversion of the CDF, but they did it differently, involving gradient descent. I am happy with the results I got, so think the table is good enough.

Doing that, we get blue noise points nicely distributed on a disk, and it only took 0.73 seconds to generate.

Blue noise points on a disk are useful for a couple things. If you want to select points to sample on a disk light source (such as the sun, perhaps), you can use blue noise points to get good coverage over the sampling domain, and without aliasing problems from a low sample count. You can also take the (x,y) values of these 2D points and add a Z component with the positive value that makes it a normalized vector ( $z = \sqrt{1-x^2-y^2}$ ) and that gives you blue noise distributed cosine weighted hemispherical samples, useful for importance sampling the $\cos{\theta}$ term in lighting calculations.

Fun fact: this projection of a density function (PDF) onto a line is actually a partial Radon transform, which is from Tomography, and relates to how we are able to make images from xrays.

Generating Points in a Square

While points in a disk are useful, we started by trying to make points in a square. To do that, we’ll need to project the area of the square down onto each 1D line, like we did the circle, and use that area to drive where the points go, to make each point have equal area of the square. This is more challenging than the circle case, because a circle is rotationally symmetric, but a square isn’t, so the function to give us evenly spaced points also has to take the projection direction into account.

We ultimately want to project the area of a rotated square onto a 1d line, and use that to give an equal amount of area to each point we distribute along that line.

My mathematically talented friend William Donnelly came up with this function (which is in squaredcdf.h in the repo). The value u should be between -0.5 and +0.5, and will return the x value of where the point should go on the line defined by direction d. The square has width of 1, and is centered at the origin, so the unrotated square corners are at (+/- 0.5, +/- 0.5). A sketch of the derivation can be found at https://blog.demofox.org/2023/12/24/deriving-the-inverse-cdf-of-a-rotated-square-projected-onto-a-line/.

inline float InverseCDF(float u, float2 d)
{
	float c = std::max(std::abs(d.x), std::abs(d.y));
	float s = std::min(std::abs(d.x), std::abs(d.y));

	if (2 * c * std::abs(u) < c - s)
	{
		return c * u;
	}
	else
	{
		float t = 0.5f * (c + s) - sqrtf(2.0f * s * c * (0.5f - std::abs(u)));
		return (u < 0) ? -t : t;
	}
}

If we use that to space the points along each line, again doing a batch size of 64, and doing 1000 batches, we end up with this, which took 0.89 seconds to generate.

The first image is the starting state, the 2nd and 3rd are the ending states, and last is the DFT to show the characteristic looking blue noise spectrum with a dark circle in the center of suppressed low frequencies.

Regarding the third image, if you think through what’s happening with random projections, and distributing points according to those 1D projections, there is nothing explicitly keeping the points in the square. It’s surprising to me they stay in there so well. The third image shows how well they are constrained to that square.

The blue noise points don’t look that great though. There are some bent line type patterns near the edges, and especially the corners of the square. Unfortunately, these points seem to have reached a point of “overconvergence” where they start losing their randomization. You can see the same thing in the sliced optimal transport paper. One solution to this problem is to just not do as many iterations. Below is what the point set looks like after one tenth of the iterations, or 100 batches. The DFT shows stronger lower frequencies, but the visible patterns in the point sets are gone. As we’ll see further down, adjusting the batch size may also help.

Comparison To Other Blue Noise Point Sets

Below are shown three types of blue noise point sets, each having 1000 points. Points are on the left, and DFT is on the right to show frequency characteristics.

Dart Throwing – Points are placed in a square going from [0,1]. Each point is generated using white noise and is accepted if the distance to any existing point is greater than 0.022. I had to tune that distance threshold by hand to make it as large as possible, but not so large that it couldn’t be solved for 1000 points. Wrap around / toroidal distance is used: (https://blog.demofox.org/2017/10/01/calculating-the-distance-between-points-in-wrap-around-toroidal-space/)
Sliced Optimal Transport aka SOT – The method in this post, using a batch size of 64, and 100 iterations.
Mitchell’s Best Candidate – For the Nth blue noise point, you generate N+1 candidate points using white noise. Keep whichever point is farthest from the closest existing point. Wrap around / toroidal distance is used. (https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/)

I’d say dart throwing is the worst, with a less pronounced dark ring in the frequency domain. For the best, I think it’s a toss up between SOT and MBC. MBC suppresses more of the low frequencies, but SOT more strongly suppresses the frequencies it does suppress. We saw earlier that doing more iterations can increase the size of the dark circle in the DFT, but that it makes the point set too ordered. It may be situational which of these you want. There is a big difference between using blue noise point sets for organic looking object placement in a game, and using blue noise point sets for sparse sampling meant to then be filtered by a matching gaussian filter. The first is an artistic choice and the second is a mathematical one.

There other methods for generating blue noise point sets. “Gaussian Blue Noise” by Ahmed et al. is the state of the art, I believe: https://arxiv.org/abs/2206.07798

Algorithm parameters

A batch size of 64 is used in generating all the point sets so far. Here’s what other batch sizes look like, at 1000 iterations.

it looks like increasing batch may also be a way to get rid of the regularity we saw before at a batch size of 64, after 1000 iterations. More investigation needed here, but being able to run an algorithm to completion, instead of having to stop it early at some unknown point, is a much better way to run an algorithm.

Here is a log/log convergence graph that shows average pixel movement each batch. The x axis is the total number of random lines projected, which is constant across all batch sizes. The number of iterations is increased for smaller batch sizes and decreased for larger batch sizes to give an “equal computation done” comparison. This isn’t necessarily a graph of quality though, it just shows each batch size reaching the final state, whatever quality that may be. More investigation is needed.

We’ve been using white noise for generating the random directions to project onto. What if we instead used the golden ratio low discrepancy sequence to generate pseudorandom rotations that were more evenly spaced over the circle? Below is a graph that shows that for both square and circle, 1000 iterations of a batch size of 64. The golden ratio variants move slightly less, and move less erratically, but the difference seems pretty small.

I can’t tell much of a difference in the disk points:

And the square points look real similar too. The DFT does seem smoother for the golden ratio points though, and the light circle around the dark center seems to be less lumpy. This might be worth looking into more later on, but it seems like a minor improvement.

I think it could also be interesting to try starting with a stratified point set, or a low discrepancy point set, instead of white noise, before running the SOT algorithm. That might help find a better minimum, or find it faster, by starting with a point set that was already pretty well evenly spaced. Perhaps more generally, this algorithm could start with any other blue noise point set generation method and refine the results, assuming that the points created were lower quality than this algorithm can provide.

Using Density Maps

Generating these sample points using sliced optimal transport involves projecting the area of where we want our points to be down onto a 1D line, and using that projected area as a PDF for where points should go to get equal amounts of that projected area.

What if instead of projecting down a shape, which projects a boolean “in shape” or “not in shape” onto the line, what if we instead used a greyscale image as a density map and projected the pixel values down onto the line to make a PDF?

I did that, projecting each pixel of a 512×512 image down onto 1000 buckets along the line, adding the pixel value into each bucket it touches, but multiplying the pixel value by how much of the bucket it takes up. The pixel values are also inverted so that bright pixels are places where dots should be less dense, because I’m doing black dots on a white background.

Below right is the image used as the density map. Below left is the starting point and below middle is the final result, using 20,000 points, 1,000 iterations and a batch size of 64. It took 50 seconds to generate, and I think it looks pretty nice! I’m amazed that doing random 1d projections of an image onto a line results in such a clean image with such sharp details.

The reason this takes so much longer to generate that the other point sets seen so far is the looping through all 512×512=262,144 pixels and projecting them down onto the 1d line for each random projection vector. I’ve heard that there are ways to make this faster by working in frequency space, but don’t know the details of that.

In game development, perhaps this could be used to place trees in a forest, where a greyscale density texture controlled the density of the trees on the map.

I haven’t ever seen it before, but you could probably use a density map with both dart throwing and Mitchell’s best candidate as well. Both of those algorithms calculate distance between points. You could get the density value for each point, convert that density to a radius value, and subtract the two radius values from the distance between the points. It would be interesting to compare their quality vs these results.

Generating Multiclass Samples

Multiclass blue noise point sets are blue noise point sets where each point belongs to a class. Each class of points should independently be blue noise, but combined combinations of classes should also be blue noise.

The “Sliced Optimal Transport Sampling” paper that this post is exploring has support for this. Every other iteration, after projecting the points onto the line and sorting them, they then make sure the classes are interleaved on that line. They show it for 2 and 3 classes with equal point counts.

If using this for placing objects on a map, you might have one class for trees, another class for bushes, and another class for boulders. However, you might not want an equal number of these objects, or equivalently, you may want more space between trees, than you want space between bushes. To do that, you’d need to have different point counts per class.

Luckily that ends up being pretty easy to do. Let’s say we have three classes with weights 1, 4 and 16. Those sum up to 21. When generating your random points, you can use the point index to calculate which class it is:

int class = 2;
if (index % 21 == 0)
  class = 0;
else if (index % 21 < 5)
  class = 1;

Then, when doing the “interleave” step that is done every other iteration, after sorting the points, we make sure that there is a class 0, then four class 1s, then sixteen class 2s, repeating that pattern over and over.

Doing that in a square, with 1000 points, 1000 iterations, and a batch size of 64 gives us this, which took 0.57 seconds to generate. Click the image to see it full sized in another window. The top row shows the classes as colors RGB. The middle shows the combined point set, and bottom shows the DFT.

This works just fine with density maps too, generating this in 49 seconds:

Optimal Transport

Generating multi class and uneven density blue noise point sets works well using sliced optimal transport, but where is the actual optimal transport of this, and what exactly is being transported?

Going back to the last post’s explanation of optimal transport being about finding which bakeries should deliver to which cafe’s to minimize the total distance traveled, the initial point sets are the bakeries, and instead of there being discrete cafe’s, the PDF / density maps define a fuzzy “general desire for baked goods in an area”.

The optimal transport here is finding where to deliver baked goods to most evenly match demand for baked goods, and doing so, such that the sum of the distance traveled by all baked goods is minimized.

When we run the sliced optimal transport algorithm, we do eventually end up with the points being in the “optimal transport” position at the end, but the path that the points traveled to get there are not the optimal transport solution.

The optimal transport solution is the line from the initial point locations to their ending locations. That optimal transport solution will be a straight line, while the path the points took in SEARCH of the optimal transport solution may not be straight lines.

Below is the evolution of 20,000 points as they search for the optimal transport solution, over 100 iterations, with a batch size of 64. The points are not taking the path of optimal transport, the points are searching for their final position for the optimal transport solution.

Below is an animation that interpolates from the starting point set, to the ending point set, over the same amount of time (100 frames). This is the points taking the path of the optimal transport solution!

Closing

The paper which introduced dart throwing was “Stochastic sampling in computer graphics”, Cook 1986. That paper explains how in human eyes and some animals, the fovea (the place where vision is the sharpest) places photoreceptors in a hexagonal packing. This is interesting because a hexagonal grid is more evenly distributed than a regular grid, which has diagonal distances 50% longer than cardinal distances. Outside of the fovea, where the photoreceptors are sparser, a blue noise pattern is used, which is known to be roughly evenly spaced, but randomized which avoids aliasing.

To me, this is nature giving us a hint for how to do sampling or rendering. For best quality and higher sample counts, a hex grid (or more generally, a uniform grid) is best, and for lower / sparser sample counts, blue noise is best.

That information is in section 3 https://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/readings/cook86_sampling.pdf

I hope you enjoyed this. Next up will be a post looking more deeply at an interesting 2022 paper “Scalable multi-class sampling via filtered sliced optimal transport” (https://www.iliyan.com/publications/ScalableMultiClassSampling). That paper is what convinced me I needed to learn optimal transport, and is what led to this series of blog posts.

Deriving the Inverse CDF of a Rotated Square Projected onto a Line

This goes along with the blog post for generating blue noise point sets using sliced optimal transport at https://blog.demofox.org/2023/12/24/making-blue-noise-point-sets-with-sliced-optimal-transport/

The code implementing this can be found on github at https://github.com/Atrix256/SOTPointSets/blob/main/squarecdf.h

The diagrams in this post were made at https://www.geogebra.org/geometry

I tried deriving this a few times, but kept getting a PDF that was more complex than I’d like. It turned out my mathematically inclined friend, and frequent co-author William Donnelly (such as in https://www.ea.com/seed/news/constant-time-stateless-shuffling), had already worked this out previously, so he shared how he did it, which I explain below more as a sketch than a proof. Treat the code as the source of truth here 🙂

The PDF

Let’s start out with a square centered at (0,0) with a width and height of 1.

Now we rotate it. This is 25 degrees, but the specific angle doesn’t matter.

We want to find the height of the shaded region at a given point x. For now we’ll stick to the right half of the rotated square. In the diagram below, we are trying to find the length of the line segment EF. This is also the hypotenuse of the right triangle EFG.

The definition of cosine of theta is that it’s “adjacent over hypotenuse”, or the length of EG over the length of EF. We know that the length of EG is 1 because that is the height of the square, so we have:

$\cos(\theta) = \frac{1}{|EF|}$

We can re-arrange that to get:

$|EF| = \frac{1}{\cos(\theta)}$

Theta ( $\theta$ ) is the angle that our square is rotated by.

That is our formula for the height of the rotated square at a point x, and it is a constant value from x=0 to the x value of P.

The value of P before rotation is (0.5, 0.5). The 2d rotation formula is:

$p'_x=p_x\cos(\theta)-p_y\sin(\theta) \\p'_y=p_x\sin(\theta)+p_y\cos(\theta)$

So, the x value of P after rotation is $\frac{\cos(\theta)-\sin(\theta)}{2}$ . This is the point at which our height equation needs to change to a different formula.

While we are rotating points, we can also rotate the point Q (0.5, -0.5) and get an x value of $\frac{\cos(\theta)+\sin(\theta)}{2}$ .

That x value of Q is where our projected height goes to 0, and at the x value of P, we know the projected height is $\frac{1}{\cos(\theta)}$ . We also know that the area is a linear function (it’s the top line minus the bottom line). So, to get the 2nd part of our height function, we are trying to find the formula of the second between P and Q below:

We know that the distance from P to Q on the x axis is the x axis point of Q minus the x axis point of P.

$\frac{\cos(\theta)+\sin(\theta)}{2} - \frac{\cos(\theta)-\sin(\theta)}{2} = \sin(\theta)$

We know the height already because it’s the same as the distance for AD, so the slope of the line from P to Q is rise over run, or:

$\frac{1}{\cos(\theta)} \div \sin(\theta) = \frac{1}{\cos(\theta)\sin(\theta)}$

instead of making a formula from P and subtracting the slope, we can instead subtract x from the maximum value x can take, and multiple the slope in. We know that Q is 0, so that is our formula:

$(\frac{\cos(\theta)+\sin(\theta)}{2} - x) \cdot \frac{1}{\cos(\theta)\sin(\theta)}$

We now have a piecewise height function, for the right side of the x axis, for a rotated square:

$\text{Height}(x) = \begin{cases} \frac{1}{\cos(\theta)} & \text{if } x < \frac{\cos(\theta)-\sin(\theta)}{2} \\ (\frac{\cos(\theta)+\sin(\theta)}{2} - x) \cdot \frac{1}{\cos(\theta)\sin(\theta)} & \text{else if } x < \frac{\cos(\theta)+\sin(\theta)}{2} \end{cases}$

The sines and cosines make the formula look complex, but f your equation is passed the normalized vector that the points are being projected onto, the x axis is cosine of theta and the y axis is the sine of theta. That is point F below.

Due to vertical and horizontal symmetry of the rotated square, we can take the absolute value of that vector’s components to bring it to the 1st quadrant. We can go further and set cosine to the maximum of x and y, and sine to the minimum of x and y, and bring the solution to 0 to 45 degrees.

float cosTheta = std::max(std::abs(direction.x), std::abs(direction.y));
float sinTheta = std::min(std::abs(direction.x), std::abs(direction.y));

Doing that, our height function handles all possible angles. Also, because the area of our square is 1 – it is centered at (0,0) with a width and height of 1 – our height function is a normalized PDF!

$\text{PDF}(x) = \begin{cases} \frac{1}{\cos(\theta)} & \text{if } x < \frac{\cos(\theta)-\sin(\theta)}{2} \\ (\frac{\cos(\theta)+\sin(\theta)}{2} - x) \cdot \frac{1}{\cos(\theta)\sin(\theta)} & \text{else if } x < \frac{\cos(\theta)+\sin(\theta)}{2} \end{cases}$

The CDF

Next we need to integrate the PDF into a CDF. The first section of the piecewise PDF is constant so when integrated will become linear. The second section is linear so will become quadratic.

$\int \frac{1}{\cos(\theta)} \,dx = \frac{x}{\cos(\theta)}$

$\int (\frac{\cos(\theta)+\sin(\theta)}{2} - x) \cdot \frac{1}{\cos(\theta)\sin(\theta)} \,dx = \frac{1}{2} (\frac{\cos(\theta)+\sin(\theta)}{2} - x)^2 \cdot \frac{1}{\cos(\theta)\sin(\theta)}$

That gives us:

$\text{CDF}(x) = \begin{cases} \frac{x}{\cos(\theta)} & \text{if } x < \frac{\cos(\theta)-\sin(\theta)}{2} \\ \frac{1}{2} (\frac{\cos(\theta)+\sin(\theta)}{2} - x)^2 \cdot \frac{1}{\cos(\theta)\sin(\theta)} & \text{else if } x < \frac{\cos(\theta)+\sin(\theta)}{2} \end{cases}$

Will has the CDF’s values centered at 0, so subtracts the second case from 0.5 and returns the negated version of it if x < 0, else returns the positive version.

The Inverse CDF

The linear section of the CDF is easy enough to invert:

$y = \frac{x}{\cos(\theta))} \\ x = y \cdot \cos(\theta)$

We need to know what y value switches from this linear section to the quadratic section though. For that, we can plug the maximum value of that PDF section into that section of the PDF to get that y value.

$f(x) = \frac{x}{\cos(\theta)} \\ f(\frac{\cos(\theta)-\sin(\theta)}{2}) = \frac{\cos(\theta)-\sin(\theta)}{2\cos(\theta)}$

So, we have the first section of our inverse CDF. For the second half, I’ll wave my hands here a bit and give the answer. Keep in mind, this is for a “CDF” with y values ranging between -0.5 and 0.5. You can shift the result if you need a more traditional CDF.

$\text{ICDF}(y) = \begin{cases} y \cdot \cos(\theta) & \text{if } y < \frac{\cos(\theta)-\sin(\theta)}{2\cos(\theta)} \\ \frac{\cos(\theta) + \sin(\theta)}{2} - \sqrt{2 \sin(\theta) \cos(\theta)(0.5 - |y|)} & \text{otherwise} \end{cases}$

The Code

It’s on github, but also pasted here in case that link is dead in the future for some reason.

// Derived By William Donnelly

struct float2
{
	float x, y;
};

// The square is centered at (0,0) and has width and height of 1.
// So, the square is (x,y) in (-0.5,0.5)^2
// The CDF / ICDF are then shiften to be between -0.5 and 0.5, instead of being from 0 to 1.
namespace Square
{
	inline float InverseCDF(float u, float2 d)
	{
		float c = std::max(std::abs(d.x), std::abs(d.y));
		float s = std::min(std::abs(d.x), std::abs(d.y));

		if (2 * c * std::abs(u) < c - s)
		{
			return c * u;
		}
		else
		{
			float t = 0.5f * (c + s) - sqrtf(2.0f * s * c * (0.5f - std::abs(u)));
			return (u < 0) ? -t : t;
		}
	}

	inline float CDF(float x, float2 d)
	{
		float c = std::max(std::abs(d.x), std::abs(d.y));
		float s = std::min(std::abs(d.x), std::abs(d.y));

		if (std::abs(x) > 0.5 * (c + s))
		{
			return (x < 0.0f) ? -0.5f : 0.5f;
		}
		else if (std::abs(x) < 0.5f * (c - s))
		{
			return x / c;
		}
		else
		{
			float fromEnd = (0.5f * (c + s) - std::abs(x));
			float u = 0.5f - 0.5f * fromEnd * fromEnd / (c * s);
			return (x < 0.0f) ? -u : u;
		}
	}

	inline float PDF(float x, float2 d)
	{
		float c = std::max(std::abs(d.x), std::abs(d.y));
		float s = std::min(std::abs(d.x), std::abs(d.y));

		if (abs(x) < 0.5f * (c - s))
		{
			return 1 / c;
		}
		else if (abs(x) < 0.5f * (c + s))
		{
			float fromEnd = (0.5f * (c + s) - std::abs(x));
			return fromEnd / (c * s);
		}
		else
		{
			return 0.0f;
		}
	}
};

Interpolating Color Image Histograms Using Sliced Optimal Transport

This post goes through a common exercise in applying optimal transport to graphics, using informal language and simple, standalone C++ to implement it. The post uses sliced optimal transport which compared to standard optimal transport solvers is more intuitive, more efficient, and makes for a simpler implementation.

The code that goes with this post is ~300 lines of c++ code that uses STB for image reading and writing, but is otherwise standalone, and can be found at https://github.com/Atrix256/SOTImageColors.

Optimal Transport

The diagram below shows a number of doughnuts at 3 bakeries, that need to go to 5 cafes. The supply (blue numbers) equals the demand (green numbers), for 13 doughnuts total that need to be transported from bakeries to cafes.

Optimal transport is the answer to the question “How many baked goods should each bakery send to each restaurant to minimize the delivery distance (cost)?”.

Solving this problem in general is pretty complex and computationally costly. Check out the Sinkhorn Algorithm and Hungarian Algorithm for details.

The problem becomes a lot simpler in 1 dimension though. You sort the bakeries by x, and the cafes by x, then assign deliveries by matching index. The first bakery delivers to the first cafe, the second bakery delivers to the second cafe and so on.

In this diagram, each bakery only has 1 doughnut, and each cafe only wants 1 doughnut. To make a bakery have more than 1 doughnut, you would just duplicate the bakery to make N in the same location, for the N doughnuts produced. To make a cafe want more than 1 doughnut, you would similarly duplicate the cafe to make M in the same location, for the M doughnuts required. The algorithm is the same and is solved by just sorting both lists!

Sliced Optimal Transport

Sliced optimal transport is a magical algorithm that it lets you solve higher dimensional optimal transport problems by solving several one dimensional optimal transport problems instead.

More specifically, you generate a uniform random unit vector, project your source and target data onto that vector using dot product, and sort, which makes the points equally spaced along the line.

Doing that once isn’t going to be enough to solve for the optimal transport, so you do that several times with different uniform random vectors, in a batch, and keep an average of the full N dimensional vector that each projection moved each point. At the end of the batch, you move the source towards the destination by that amount.

Doing one batch isn’t going to be enough to solve for the optimal transport either, so you do several batches, until it converges to an answer.

The number of vector projections per batch, and the number of batches, are hyperparameters that you need to tune for your particular optimal transport problem. You could also set up heuristics, like ending the program when the total amount of movement was below a certain threshold.

If you sat down to implement this, you might wonder how to generate an N dimensional uniform random unit vector, where N is any number of dimensions. Luckily there is a really nice and simple way. You generate an N dimensional vector using a normal distribution for each vector component, then normalize it! std::normal_distribution is your friend, in C++.

Interpolating Color Image Histograms

The goal of this post is to have one image take the colors of another image, which you may know as “Color Grading” if you work in games or graphics.

How does this relate to optimal transport, and our bakeries and cafes?

Each pixel in our source image is a bakery with one doughnut. The RGB value of the pixel is the location of that bakery in a 3D space, if we treat RGB as XYZ for a 3 dimensional vector.

Each pixel in the image we want to take the colors of is a cafe that wants one doughnut. The RGB value of the pixel is the location of that cafe in a 3D space as well.

Our goal is to move each source image pixel RGB value (each doughnut) to a target color image pixel RGB value (a cafe). In the end, there will only be the colors from the target color image.

We could just do this randomly, and set each source color image pixel to a randomly selected target color image pixel, but the source image would look like random noise.

If we do this using optimal transport instead, it means that we assign target color image pixel colors to source image pixels such that we’ve modified the source image pixels the least. That means the relationship between pixels will largely stay the same and we will end up with something that still looks like our source image, but uses only the colors from the target color image.

Note: the source image and the target color image need to be the same size, so that the number of doughnuts available from the bakeries match the number of doughnuts desired at the cafes.

Algorithmic Details

I didn’t spend very long tuning the batch size and number of iterations, but I found good results with a batch size of 16 and 100 iterations.

The sorting is the bottleneck by far, but I found that each batch could be done in parallel, which helped performance a lot on my machine.

The images I used were 792×516 and I am able to calculate optimal transport for color histograms in about 16 seconds on my machine.

I had the algorithm spit out the average movement per pixel to a csv after every step, with the assumption that smaller steps mean it is closer to convergence. Below is the graph of “big cat”, but the other two were very similar.

Images

Here is the source image I used (florida), and it’s histogram.

Here are the 3 images i used as targets, along with their histograms: bigcat, dunes, turtle.

Results

Florida remapped to big cat

Florida remapped to dunes

Florida remapped to turtle

If you compare the histograms closely, you’ll see that while they are not a 100% match, they are extremely close. Below shows the difference between the big cat histogram and the florida remapped to big cat histogram.

This is also noticeable in the dunes result, where the histogram shows that the colors are not all perfectly grey. An easy way to compare is to open one of the source image/histograms in the last section in one browser tab, and the corresponding result below in another tab, and flip back and forth.

Interpolation

Solving the optimal transport problem tells you where to move each source image pixel to make the color histogram match a target. Instead of moving the pixels all the way in the RGB 3d cube to the correct location, you can move the pixels part of the way there, thus interpolating the source image histogram to the target image histogram. Below shows an interpolation from florida to big cat, at 0%, 25%, 50%, 75% and then 100%.

When you interpolate from A to B by 25%, you are really doing barycentric interpolation along a line (a 1D simplex) with barycentric coordinates (u,v) = (0.75, 0.25). To get the result, you are multiplying the original image by 0.75, the target image by 0.25, and adding the results together.

Barycentric interpolation generalizes to any dimension. In 2D, you get three barycentric coordinates and are interpolating on a triangle.

Also, while barycentric coordinates are meant for doing interpolation, but having all coordinates sum up to 1, and all be between 0 and 1, you can break those assumptions. When a coordinate is less than 0 or greater than 1, then you are doing barycentric extrapolation. You could do the same here, if you for instance just wanted a photo to be LESS like another photo, as far as the color histogram was concerned.

I’ll leave that to you to experiment with though, if that sounds interesting 🙂

Lastly, the code uses white noise uniform random lines to project onto. White noise is good “at the limit” but for smaller numbers of samples, it tends to clump up and leave holes. I’m betting that using a low discrepancy sequence to generate the lines would give better results that converge faster. I’m leaving that as a teaser for you again, but also think I will give that a try in the future.

Links and Resources

Here’s a post very similar to mine, also in C++: https://dcoeurjo.github.io/OTColorTransfer/

A previous blog post of mine talked about how to use optimal transport to measure the similarity of two 1D histograms. Check it out here: https://blog.demofox.org/2022/07/27/calculating-the-similarity-of-histograms-or-pdfs-using-the-p-wasserstein-distance/

A simple introduction on Sinkhorn distances: https://amsword.medium.com/a-simple-introduction-on-sinkhorn-distances-d01a4ef4f085

Sliced Optimal Transport: https://www.numerical-tours.com/matlab/optimaltransp_4_matching_sliced/

How To Make Your Own Spooky Magic Eye Pictures (Autostereograms)

The simple standalone C++ code that goes with this post and makes autostereograms, can be found at https://github.com/Atrix256/Autostereogram.

The 1990s! They felt like a wasteland of culture at the time, but looking back, there was hyper color t-shirts, the beginning of mainstream computing and the internet, the height of alternative rock, and of course magic eye pictures.

Unfocus your eyes such that the two dots overlap. A 3D image should emerge!

Quick PSA if you can’t see it!

To make an autostereogram, you need two things:
1. Color Image: A tileable repeating pattern. This can also just be “white noise”.
2. Depth Image: A grey scale depth map, or a black and white mask. This defines the 3D shape. Brighter pixel values are closer in depth.

For the above, I snagged these two from pintrest.

The image you are making is going to be the width and height of the depth image, but is going to have as many color channels as the color image.

You build the output image row by row, from left to right. To start out, we can just tile the output image with the color image. The Output Image pixel at (x,y) is the Color Image pixel at (x % ColorWidth, y % ColorHeight). That makes a tiled image, which does not have any 3d effect whatsoever:

To get a 3D effect we need to modify our algorithm. We need to read the Depth Image at pixel (x,y) to get a value from 0 to 255. We divide that by 255 to get a fractional value between 0 and 1. We then multiply that value by the “maximum offset amount”, which is a tuneable parameter (i set it to 20), to get an offset amount. This offset is how much we should move ahead in the pattern.

So, instead of Output Image pixel (x,y) using the Color Image pixel (x % ColorWidth, y % ColorHeight), we are calculating an offset from the Depth Image and using the Color Image pixel ((x + offset) % ColorWidth, y % ColorHeight).

Doing that, we aren’t quite there. Some 3D effects are starting to pop out, but it doesn’t look quite right.

In fact, if you use the simpler depth map of the rectangles shown below, you can see the rectangles just fine, but there seems to be holes to the right of them.

What we need to do is not just look into the Color Image at an offset location, but that we need to look at the Output Image we are building, at an offset location. Specifically, we need to look at it in the previous color tile repetition. We use the Output Image pixel at ((x + offset – ColorWidth), y).

A problem with that though, is that when x is less than ColorWidth, we’ll be looking at a pixel x value that is less than 0 aka out of bounds. When x < ColorWidth, we should use the Color Image pixel instead, using the same formula we had before ((x + offset) % ColorWidth, y % ColorHeight).

That fixes our problem with the simpler squares depth map. The holes to the right are gone.

And it also mostly fixes our “grave” image:

There is one problem remaining with the grave image though. How these images work is that your left eye needs to lined up with an unmodified tile on the left, and your right eye needs to be lined up with a modified tile on the right. The grave image has depth information very close to the left side, which makes that not be possible. To fix this, you can add an extra “empty color tile” on the left. That makes our image a little bit wider but it makes it work. This also has the added benefit of centering the depth map, where it previously was shifted to the left a bit.

There we are, we are done!

Other Details

I found it useful to normalize the greyscale depth map. Some of them don’t use the full 0 to 1 range, which means they aren’t making the most use of the depth available. Remapping them to 0 to 1 helps that.
Some masks were meant to be binary black or white, but the images i downloaded form the internet had some grey in them (they were .jpg which is part of the problem – lossy compression). Having an option to binarize these masks was useful, forcing each pixel value or 0 or 1, whichever was closer.
The binary masks i downloaded had the part i was interested in being black, with a white background. I made the code able to invert this, so the interest part would pop out instead of receeding in.
The distance between the helper dots on the images are the width of the Color Image. A wider color image means a person has to work harder to get those dots to overlap, and it may not even be possible for some people (I’m unsure of details there hehe). I used tile sizes of 128.
It’s hard to make out fine detail from the depth maps. It seems like larger, coarse features are the way to go, instead of fine details.

More Pictures

Here is the grave, with a different color texture. I find this one harder to see.

And another, using RGB white noise (random numbers). I can see this one pretty easily, but it isn’t as fun themed as the pumpkin image 🙂

And here is greyscale white noise (random numbers) used for the color image. I can see this just fine too.

I also tried using blue noise as a Color Image but I can’t see the 3d at all, and it isn’t a mystery what the 3d image is. You can see the repetition of the depth map object from the feedback. I think it’s interesting that the repetition is needed to make it look right. I have no understanding of why that is, but if you do, please leave a comment!

Here are some images that are not the grave. I find the pumpkin color texture works pretty nicely 🙂

Links

This video kicked off this nerd snipe: https://youtu.be/-okxLz1UauA?si=y_QK8-Bv4EzZSGBv

This was also helfpul: https://flothesof.github.io/making-stereograms-Python.html

Here is the code I wrote that makes these autostereograms again: https://github.com/Atrix256/Autostereogram

Lastly, I think it would be really neat to make a game that used this technique to render 3d. It could be something simple like a brick breaking game, or as complex as a first person shooter. A challenge with this is that you need to process the image from left to right, due to the feedback loop needed. That won’t be the fastest operation on the GPU, forcing it to serialize pixel processing unless anyone has any clever ideas to help that. Still, it would be pretty neat as a tech demo!

When There Are Too Many Experiments (Tests) To Do! Part 2: Orthogonal Arrays

The simple standalone C++ code to generate orthogonal arrays that goes with this post can be found at https://github.com/Atrix256/OrthogonalArray

The last blog post talked about a way to design a series of experiments with N combinations that let you do only N/2 of the experiments, or N/4, or fewer. It also had a formalized framework for understanding what data you could and could not take from the results. That post is at: https://blog.demofox.org/2023/10/17/fractional-factorial-experiment-design-when-there-are-too-many-experiments-to-do/

This post is also about reducing the number of experiments needed, but uses orthogonal arrays, and is based on work by Genichi Taguchi (https://en.wikipedia.org/wiki/Genichi_Taguchi).

Let’s motivate this with a tangible situation. Imagine we are trying to find the best brownie recipe and we have three binary (yes/no) parameters. The first parameter “A” determines whether or not to add an extra egg. A value of 1 in a test means to add the egg, a value of 0 means don’t add the egg. The second parameter “B” determines whether to double the butter. The third option “C” determines whether we put nots on top.

Here are the 8 combinations of options, which you may recognize as counting to 7 in binary.

	A (extra egg)	B (double butter)	C (add nuts)
Test 0	0	0	0
Test 1	0	0	1
Test 2	0	1	0
Test 3	0	1	1
Test 4	1	0	0
Test 5	1	0	1
Test 6	1	1	0
Test 7	1	1	1

Now we want to reduce the number of experiments down to 4, because we only have 4 brownie pans and also don’t want to get too fat trying to figure this out. Here are the 4 experiments we are going to do, along with a “result” column where we had our friends come over and rate the brownies from 1-10, and we took the average for each version of the recipe.

	A	B	C	Result
Test 0	0	0	0	8
Test 1	0	1	1	6
Test 2	1	0	1	9
Test 3	1	1	0	7

At first glance the tests done seem arbitrary, and the results seem to not give us much information, but there is hidden structure here. Let’s highlight option A being false as bold, and true as being left alone.

	A	B	C	Result
Test 0	0	0	0	8
Test 1	0	1	1	6
Test 2	1	0	1	9
Test 3	1	1	0	7

If you notice, we are doing two tests for A being false, and in those two tests, we have a version where B is true, and a version where B is false. The same is true for C; it’s true in one test and false in the other. This essentially makes the effects of the B and C options cancel out in these two tests, if we average the results.

Looking at the two tests for when A is true, the same thing happens. One test has B being true, the other false. One test has C being true, the other false. Once again this makes the effects of B and C cancel out, if we average the results.

Averaging the results when A is false gives us 7. Averaging the results when A is true gives us 8. This seems to indicate that adding an extra egg gives a small boost to the tastiness of our brownies. In our tests, due to the effects of B and C canceling out, we can be more confident that we really are seeing the results of the primary effect “A” (Adding an extra egg), and not some secondary or tertiary effect.

More amazingly, we can highlight the tests where B is false and see that once again, in the two tests for B being false, A and C have both possible values represented, and cancel out again. Same for when B is true.

	A	B	C	Result
Test 0	0	0	0	8
Test 1	0	1	1	6
Test 2	1	0	1	9
Test 3	1	1	0	7

The same is true if we were to highlight where C is false. Every parameter is “isolated” from the others by having this property where the other parameters cancel out in the tests.

Let’s make a table showing the average scores for when A,B and C are false vs true.

	No	Yes
A (extra egg)	7	8
B (double butter)	6.5	8.5
C (nuts on top)	7.5	7.5

From this table we can see that doubling the butter gives the biggest increase in score to our brownies. Adding another egg helps, but not as much. Lastly, people don’t seem to care whether there are nuts on top of the brownies or not.

It’s pretty neat that we were able to do this analysis and make these conclusions by only testing 4 out of the 8 possibilities, but we also have missing information still – what we called aliases in the last blog post. Software that deals with fractional factorial experiment design (there’s one called minitab) are able to show the aliasing structure of these “Taguchi Orthogonal Arrays”, but I wasn’t able to find how to calculate what the aliases are exactly. If someone figures that out, leave a comment!

I just spilled the beans, in that the table of tests to do is called an “Orthogonal Array”. More specifically it’s an orthogonal array of strength 2.

For an orthogonal array of strength 2, you can take any 2 columns, and find all possible combinations of the parameter values an equal number of times. Here is the testing table again, along with the values of “AB”, “AC”, and “BC” to show how they each have all possible values (00, 01, 10, 11) exactly once.

	A	B	C	AB	AC	BC
Test 0	0	0	0	00	00	00
Test 1	0	1	1	01	01	11
Test 2	1	0	1	10	11	01
Test 3	1	1	0	11	10	10

Orthogonal arrays give this isolation / cancellation property that allows you to do a reduced number of tests, and still make well founded conclusions from the results.

A higher strength orthogonal array means that a larger number of columns can be taken, and the same will be true. This table isn’t a strength 3 orthogonal array, because it is missing some values. The full factorial test on these three binary parameters would be a strength 3 orthogonal array, which gives better results, but is full factorial, so doesn’t save us any time.

Unlike the last blog post, parameters don’t have to be limited to binary (2 level) values with these arrays. We could turn parameter A (add an extra egg) into something that had three values: 0 = regular amount of eggs, 1 = an extra egg, 2 = two extra eggs.

The full factorial number of tests would then become 12 instead of 8:

	A (3 level)	B (2 level)	C (2 level)
Test 0	0	0	0
Test 1	0	0	1
Test 2	0	1	0
Test 3	0	1	1
Test 4	1	0	0
Test 5	1	0	1
Test 6	1	1	0
Test 7	1	1	1
Test 8	2	0	0
Test 9	2	0	1
Test 10	2	1	0
Test 11	2	1	1

Unfortunately in this case, there is no strength 2 orthogonal array that is smaller than the full 12 tests. The reason for this is because an orthogonal array needs pairs of columns to have an equal number of each possible value, for that pairs of columns. The column pairs of AB and AC both have 6 values, while the column pair BC has 4 values. If we did only 4 tests, the AB and AC column pairs would be missing 2 out of the 6 possible values. If we did 6 tests, the column pair BC would have 2 extra values, which can also be seen as missing 2 values to make it have all values twice.

To formalize this, the minimum number of tests you can do is the least common multiple of the column pair value combinations. In this example again, AB and AC have 6, while BC has 4. The least common multiple between 6 and 4 is 12. So, you need to do a multiple of 12 experiments with this setup to make an orthogonal array. The full factorial array is size 12, and in fact is the orthogonal array that we are talking about.

If you add another 2 level factor (binary parameter), the minimum number of experiments stays at 12, but the full factorial becomes 24, so we are able to beat the full factorial in that case.

	A (3 level)	B (2 level)	C (2 level)	D (2 level)
Test 0	0	0	0	1
Test 1	0	0	1	0
Test 2	0	1	0	1
Test 3	0	1	1	0
Test 4	1	0	0	1
Test 5	1	0	1	0
Test 6	1	1	0	0
Test 7	1	1	1	1
Test 8	2	0	0	0
Test 9	2	0	1	1
Test 10	2	1	0	0
Test 11	2	1	1	1

Robustness Against Uncontrolled Variables

I’ve only read a little about Taguchi but what I have read has really impressed me. One thing that really stood out to me is nearly a social issue. Taguchi believed making precise results was the correct ultimate goal of a factory. He believed erratic and low quality results affected the consumer and the company, and demonstrated how both ultimately affected the company. From wikipedia:

Taguchi has made a very influential contribution to industrial statistics. Key elements of his quality philosophy include the following:

Taguchi loss function, used to measure financial loss to society resulting from poor quality;
The philosophy of off-line quality control, designing products and processes so that they are insensitive (“robust”) to parameters outside the design engineer’s control.
Innovations in the statistical design of experiments, notably the use of an outer array for factors that are uncontrollable in real life, but are systematically varied in the experiment.

I’d like to highlight his work in making processes robust against uncontrolled variables.

What you do is run the tests when the uncontrolled variable is a certain way, then when it’s another way, you run the tests again. You might do this several times.

Then, for each combination of settings you do have control over, you take the standard deviation for when the uncontrolled variable varies. Whichever combination of controllable settings gives you the lowest standard deviation (square root of variance), means that setting of variables is least affected by the uncontrolled variable.

It may not be the BEST result, but it lets you have a CONSISTENT result, in uncontrollable conditions. Sometimes consistency is more important than getting the best score possible. FPS in a video game is an example of this. A smooth frame rate feels and appears a lot better perceptually than a faster, but erratic frame rate.

Note that you can use this technique with CONTROLLABLE variables too, to see if there are any variables that don’t have a strong impact on the result. You can use this knowledge to your advantage by removing it from future tests as a variable for example.

Here is another example of where that could be useful. Imagine you were making a video game, and you provided the end user with a set of graphics options to help them control the game’s performance on their machine. If you run tests of different performance, for different graphics settings, on different machines, you might find that certain settings don’t help performance much AT ALL for your specific game. If this is the case, you can remove the graphics setting from the UI, and set it to a constant value. Simplifying the settings helps you the developer, by having fewer code paths to maintain, it helps QA by having fewer code paths to test, and it helps the user, by only presenting meaningful graphics options.

An Algorithm For Orthogonal Array Generation

I couldn’t find a singular orthogonal array generation algorithm on the net. From what I read, there are different ways of constructing different types of orthogonal arrays, but no generalized one. However, it’s easy enough to craft a “greedy algorithm with backtracking” algorithm, and I did just that. The source code can be found at https://github.com/Atrix256/OrthogonalArray.

Let’s think of the orthogonal array as a matrix where each row is an experiment, and each column lists the values for a parameter across all of the experiments. The goal we are after is to have each column pair have each possible value show up once.

Assuming that all columns have the same number of possible values for a moment (like, they are all binary and can either have a 0 or 1), we can see that when filling up any single column pair with options, we will also fill up all of the other column pairs. As a concrete example, if we have 3 binary columns A,B,C, we know that by filling AB with all the values 00, 01, 10, 11 by choosing experiments from the “full factorial list”, that there will also be four values in the columns AC and BC. So, we only need to find 4 rows (specific experiments) in that case. We can just fill up the first column pair, and the other column pairs will get filled up too.

We can do this by choosing any experiment that has 00 in AB, then 01, 10, and finally 11.

However, the values that fill up the other columns may not end up in the full set of values, and there may be repeats, even if we disallow repeats of specific experiments.. Column AC may end up being 00, 00, 11, 10 for instance which has 00 showing up twice, and 01 doesn’t show up at all.

So, when we choose an experiment (row) to use that matches the value we want from the AB column, we also have to make sure it isn’t a duplicate value in any of the other columns.

Doing this, we may end up in a spot where we have values left to fill in, but none of our options are valid. At this point, we take a step back, remove the last row we added, and try a new one. This is the backtracking. We may have to backtrack several steps before we find the right solution. We may also exhaustively try all solutions and find no solution at all!

Generalizing this further, our columns may not all have the same number of possible values – like in our examples where we mixed 3 level and 2 level variables. As I mentioned before, some column pairs will have 6 possible values, while others will have 4 possible values. The only way you can make an orthogonal array in that situation is by making a number of rows that are a multiple of the least common multiple of 6 and 4, which is 12. The 6 value column pairs will have all values twice, and the 4 value column pairs will have each value three times.

Lastly, there are situations where no solution can be found by repeating the values this minimal number of times, but if you do twice as many (or more), a solution can be found, and still be smaller than the full factorial set of experiments. An example of this is if you have five 3 level factors. You cant make an orthogonal array that is only 9 rows long, but you can make one that is 18 rows long. That is still a lot smaller than the full factorial number of experiments which is 3^5 or 243 experiments.

Closing and Links

From the number of reads of the previous blog post, this seems to be a topic that really resonates with people. Who doesn’t want to do less work and still get great results? (Tangent: That is also the main value of blue noise in real time rendering!)

In these two posts, we’ve only scratched the surface. Apparently there is still researching coming out about this family of techniques – doing a smaller number of tests and gleaming the most relevant data you can from them.

I think a faster algorithm than I described could be created using Algorithm X / Dancing Links (https://blog.demofox.org/2022/10/30/rapidly-solving-sudoku-n-queens-pentomino-placement-and-more-with-knuths-algorithm-x-and-dancing-links/).

The wikipedia article links orthogonal arrays to latin squares, mutually orthogonal latin squares, sudoku and more (https://en.wikipedia.org/wiki/Orthogonal_array#Latin_squares,_Latin_cubes_and_Latin_hypercubes). Interestingly, that also links it to “Interleaved Gradient Noise” in a way, which is an imprecise solution to a generalized sudoku that has too many constraints to exactly solve (https://blog.demofox.org/2022/01/01/interleaved-gradient-noise-a-different-kind-of-low-discrepancy-sequence/).

When starting on this journey, I thought it must surely have applications to sampling in monte carlo integration, and indeed it does! There is a paper from 2019 i saw, but didn’t understand at the time, which uses orthogonal arrays for sampling (https://cs.dartmouth.edu/~wjarosz/publications/jarosz19orthogonal.html).

The rabbit hole has become quite deep, so I think I’m going to stop here for now. If you have more to contribute here, please leave a comment! Thanks for reading!

Fractional Factorial Experiment Design: When There Are Too Many Experiments To Do

Have you ever found yourself in a situation where you had a bunch of parameters to tune for a better result, but there were just too many to exhaustively search all possibilities?

I recently saw a video that talked about how to reduce the number of experiments in these situations, such that it formalizes what conclusions you can and cannot make from those experiments. Even better, the number of experiments can be a sliding scale to either be fewer experiments (faster), or more information from the results (higher quality).

We are going to explore a method for doing that in this post, but you should also give the video a watch before or after reading, as it talks about a different method than we will! https://www.youtube.com/watch?v=5oULEuOoRd0

Fractional Factorial Design Terminology

The term “fractional factorial” is in contrast to the term “full factorial”.

To have some solid context, let’s pretend that we want to figure out how to make the best brownies, and we have 3 options A, B, C. Option “A” may be “add an extra egg”, “B” may be “double the butter” and “C” may be “put nuts on top”, where 1 means do it, and -1 means don’t do it.

Three parameters with two choices each means we have 8 possibilities total:

	A	B	C
Test 0	-1	-1	-1
Test 1	-1	-1	1
Test 2	-1	1	-1
Test 3	-1	1	1
Test 4	1	-1	-1
Test 5	1	-1	1
Test 6	1	1	-1
Test 7	1	1	1

We’ve essentially counted in binary, doing all 8 tests for all 8 possibilities of the three binary parameters.

If you’ve noticed that the number of tests are 2^3, and not a factorial, you are correct. The term “factorial” refers to “factors” which is another name for a parameter. We have 3 factors in our experiment to make the best brownies.

Each factor has two possible values. In the literature, the number of values a parameter can take is called a level. So, our brownie experiment has 3 two level factors.

If we didn’t have enough pans to bake all 8 types of brownies, we may instead opt to do a smaller amount. We may only do 1/2 of the tests for instance, which means that we would be doing a fractional amount of the full factorial experiment. We’d be doing a fractional factorial experiment. We could even do 1/4 of the experiments. The fewer experiments we do, the less information we get.

Choosing which four experiments to do, and knowing what conclusions we can draw form the results is where the magic happens.

Fractional Factorial Design

Going back to our brownie recipe, let’s say that we want to do 4 bakes, instead of 8.

We start by making a full experiment table for A and B, but leave C blank

	A	B	C
Test 0	-1	-1
Test 1	-1	1
Test 2	1	-1
Test 3	1	1

We now have to decide on a formula for how to set C’s value, based on the value of A and B. This is called a generator and to test things out, lets set C to the same value as A, so our generator for C is “C = A”

	A	B	C=A
Test 0	-1	-1	-1
Test 1	-1	1	-1
Test 2	1	-1	1
Test 3	1	1	1

We have 4 experiments to do now, instead of 8, but it isn’t obvious what information we can get from the results. Luckily some algebra can help us sort that out. This algebra is a bit weird in that any letter squared becomes “I”, or identity. Our ultimate goal is to find the “Aliasing Structure” of our experiments, which is made up of “words”. Words are combinations of the parameter letters, as well as identity “I”. The aliasing structure lets you use algebra to find what testing parameters alias with other testing parameters. An alias means you can’t tell one from the other, with the limited number of tests you’ve done.

That is quite a lot of explaining, so lets actually do these things.

First up we have C = A. If we multiply both sides by C we get CC = AC. Any value multiplied by itself is identity, so we get I = AC as a result. Believe it or not, that is our full aliasing structure and we are done! We’ll have more complex aliasing structures later but we can keep it simple for now.

Now that we have our aliasing structure, if we want to know what something is aliased by, we just multiply both sides by our query to get the result. Here’s a table to show what I mean.

Alias Query	Result
I	I = AC
A	A = AAC = C
B	B = ABC
C	C = ACC = A
AB	AB = AABC = BC
AC	AC = AACC = I
BC	BC = ABCC = AB
ABC	ABC = AABCC = B

Looking at the table above, we see that B = ABC. This means that B aliases ABC. More plainly, this means that from our experiments, if the brownies were rated from 1 to 10 for tastyness, we wouldn’t be able to tell the difference in results between only doing option B (doubling the butter) and doing all three options: adding an egg, doubling the butter, and adding nuts on top.

Another thing we can see in the table is that A = C. That means that we cannot tell the difference between adding an egg, or adding nuts on top.

Lastly, I = AC means that we can’t tell the difference between doing nothing at all, compared to adding an egg and adding nuts on top.

If we care to be able to distinguish any of these things, we’d have to change our experiment design. Specifically, we would need to change the generator for C.

We can do that, and in fact there is a better choice for a generator for C. Let’s make it so that value of C is the value of A and B multiplied together. C = AB.

	A	B	C=AB
Test 0	-1	-1	1
Test 1	-1	1	-1
Test 2	1	-1	-1
Test 3	1	1	1

It doesn’t look like much has changed, but lets do the same analysis as before.

We can get our aliasing structure by starting with C=AB and multiplying both sides by C to get CC=ABC, which simplifies to I = ABC. Let’s find all of our aliases again.

Alias Query	Result
I	I = ABC
A	A = BC
B	B = AC
C	C = AB
AB	AB = C
AC	AC = B
BC	BC = A
ABC	ABC = I

Something interesting has happened with this design of experiments (D.O.E.). All of the single letters alias with double letters. This means that we can distinguish all primary effects from each other, but that primary effects are aliased (or get “confounded”) with secondary effects. In many situations, the primary effects are more prominent than secondary effects (adding butter or adding nuts is more impactful individually, than when combined). In these situations, you can assume that if there’s a big difference noticed, that it is the primary effect causing it. If there’s any doubt, you can also do more focused experiment(s) to make sure your assumptions are correct.

Resolution

So why is it that our first generator C=A wasn’t able to differentiate primary effects, while our second generator C=AB could? For such a simple example, it might be obvious when looking at C=A, but this has to do with something called “Resolution”, which is equal to the length of shortest word in the aliasing structure (not counting I).

With I=CA, that is two letters, so it is a resolution II D.O.E. Resolution II D.O.E.s have at least some primary effects aliased (or confounded) with other primary effects. Resolution II D.O.E.s are usually not very useful.

With I=ABC, that is three letters, so is a resolution III D.O.E. Resolution III D.O.E.s have primary effects aliased with secondary effects. These are more useful, as we explained above.

As you add more factors, you can get even higher resolution D.O.E.s. A resolution IV has primary effects aliased with tertiary effects, and secondary effects are aliased with each other. A resolution V has primary effects aliased with quaternary effects, and secondary effects are aliased with tertiary effects. If you are noticing the pattern that the aliasing effect classes add up to the resolution, you are correct, and that pattern holds for all resolutions.

Getting a higher resolution D.O.E. is usually better, so you want your generators to contain more letters. You have to be careful to make them unique enough though. If you had four factors A,B,C,D and generators C=AB, D=AB, you can see that you’ve introduced an alias C=D.

If you work out the aliasing structure, this problem becomes apparent that you’ve made a resolution II D.O.E. Aliasing structures always need 2^p words, where p is the number of parameters that have generators. In this case, p is 2, so you need 2^2 or 4 words in the aliasing structure.

The first word comes from C=AB, which can be rewritten as I=ABC. Actually that’s the first two words.

The second word comes from D=AB, which can be rewritten as I=ABD. That gives us three words:

I = ABC = ABD

To get the fourth word, we can see that both ABC and ABD are equal to each other. Remembering that when we multiply the same thing against itself we get identity, we can make our fourth word by multiplying them together to get something equal to identity for our aliasing structure.

ABC*ABD = AABBCD = CD.

So, our four words all together are:

I = ABC = ABD = CD

Since “CD” is the shortest word that is not identity, and is 2 letters, we can conclude that we have a resolution II D.O.E.

We can multiply the aliasing structure by C to get all the things that C is aliased with.

C = AB = ABCD = D

This shows we cannot tell the difference between C and D, because they are aliased, or confounded, and confirms that we have a resolution II D.O.E.

What’s With The Weird Algebra?

When we said C=AB, where A and B were either -1 or +1 in each column, it was pretty easy to just multiply the values together to get the result.

What probably made less sense is when working with the generators and aliasing structures, why a letter times itself would become identity. The reason for this is that the multiplication is a component wise vector product. If the original value was -1, then we get -1*-1 or +1. If the original value was +1, then we get 1*1 or +1. The result of a component wise vector product between a vector and itself will have a 1 in every component, which is identity.

Nothing too mysterious going on.

Larger Example: 8 Experiments, 5 Factors, 2 Levels

To help cement these ideas, we’ll do one more example with 8 experiment and 5 factors. That means we’ll only do 1/4 of the experiments we would if doing the full factorial, doing 8 experiments instead of 32.

8 experiments mean that the first three factors will be full factorial, and the last two parameters will need generators. We’ll use D = AB and E=BC as our generators.

We can make our experiment table to start.

	A	B	C	D=AB	E=BC
Test 0	-1	-1	-1	1	1
Test 1	-1	-1	1	1	-1
Test 2	-1	1	-1	-1	-1
Test 3	-1	1	1	-1	1
Test 4	1	-1	-1	-1	1
Test 5	1	-1	1	-1	-1
Test 6	1	1	-1	1	-1
Test 7	1	1	1	1	1

Let’s also make our aliasing structure. Since we are using generators on 2 parameters, that means p=2, and so we have 2^p = 2^2 = 4 words in our aliasing structure. The first 3 are easy, since they come from our generators directly.

D = AB : I = ABD
E = BC : I = BCE
so:
I = ABD = BCE

For the fourth word, we again know that ABD and BCE are equal since they are both equal to identity. We also know that multiplying anything by itself is identity, so we can multiply them together to get the fourth word:

ABD*BCE = ABBCDE = ACDE

Our full aliasing structure is then the below, which shows that we have a resolution III D.O.E, meaning primary effects are only aliased with secondary effects.

I = ABD = BCE = ACDE

Lastly, let’s make the full 32 entry table showing all aliases to understand the things we can, and cannot differentiate from our experiment results.

Alias Query	Result
I	I = ABD = BCE = ACDE
A	A = BD = ABCE = CDE
B	B = AD = CE = ABCDE
C	C = ABCD = BE = ADE
D	D = AB = BCDE = ACE
E	E = ABDE = BC = ACD
AB	AB = D = ACE = BCDE
AC	AC = BCD = ABE = DE
AD	AD = B = ABCDE = CE
AE	AE = BDE = ABC = CD
BC	BC = ACD = E = ABDE
BD	BD = A = CDE = ABCE
BE	BE = ADE = C = ABCD
CD	CD = ABC = BDE = AE
CE	CE = ABCDE = B = AD
DE	DE = ABE = BCD = AC
ABC	ABC = CD = AE = BDE
ABD	ABD = I = ACDE = BCE
ABE	ABE = DE = AC = BCD
ACD	ACD = BC = ABDE = E
ACE	ACE = BCDE = AB = D
ADE	ADE = BE = ABCD = C
BCD	BCD = AC = DE = ABE
BCE	BCE = ACDE = I = ABD
BDE	BDE = AE = CD = ABC
CDE	CDE = ABCE = BD = A
ABCD	ABCD = C = ADE = BE
ABCE	ABCE = CDE = A = BDE
ABDE	ABDE = E = ACD = BC
ACDE	ACDE = BCE = ABD = I
BCDE	BCDE = ACE = D = AB
ABCDE	ABCDE = CE = AD = B

Closing

A limitation you may have noticed in this article is that it only works with binary parameters (2 level factors). Part 2 of this blog post will show how to overcome that limitation using a different technique for fractional factorial experimental design.

Until then, there is a way to extend this method to factors which have a multiple of 4 levels. If you are interested in that, google Plackett-Burman.

Alan Wolfe (Atrix256/Demofox) Banned (?) From X (Twitter)

“…private tyrannies <are> the worst kind of tyrannies” — Noam Chomsky (https://www.youtube.com/watch?v=iR1jzExZ9T0).

“cruel, unreasonable, or arbitrary use of power or control.” — definition of tyranny as given by google.com, emphasis is mine.

Hey All!

It appears I’ve been banned from X (twitter). I’m writing this post to show some details of this process, as well as let people know what happened instead of me just being “disappeared” quietly.

First the process of being banned.

I log in on Friday and it says I’m suspended. It tells you that you can appeal it by “proving you haven’t broken any rules” but it doesn’t tell you what you are accused of.

After maybe a day and a half I get to thinking “you know, I don’t think I should be suspended for what I said”, so I appeal, and within 30 seconds get an email that says the appeal is denied. It only then tells me what I’m accused of.

It still doesn’t tell me the punishment: is this temporary? Is it permanent? I’ve heard of people having 3 day suspensions, but 3 days have come and gone so I’m unsure if the suspension will ever be lifted. It might be lifted tomorrow, in 6 months, or never. Who knows!

Putting this in legal terms, imagine…

You get arrested. You are not told what you are arrested for, nor how long you will be imprisoned.
You are told you can write a letter to a judge explaining how you didn’t break the law.
You write this letter, and when you hand it to the guard, he already has a response in his hand that he hands back to you.
End of conversation, you sit in jail, perhaps never to see daylight again.

Getting banned from a social media platform is not the same as being jailed, but it is a good analogy to show the level of “justice” that the platform is interested in.

When made physical, it’s akin to places like Iran or China dealing with political dissidents or other undesirables. This is why we need to be careful about conflating corporate culture with American values.

It’s also worth mentioning that through a normal course of actions, I was banned from paypal many years ago because it assumed I was a fraudulent scammer for some reason that was unappealable. Elon was involved there as well. I was just doing regular things and it came out of no where.

This isn’t the first time twitter has gotten mad at me for things I’ve said. I wished ill will at several GOP politicians, including Trump, and twitter used to have the policy of “you are banned until you delete this specific post”. But, this is the first time since Elon took over.

A short time before this happened, John Carmack was asking why folks were trying to leave X and that he couldn’t understand it, could it really be spite? I wanted to post my response to his question but it’s been deleted! Most of my tweets are still there and I can see them, but this specific one criticizing Elon Musk is gone. Could this be part of the motivation for where I’m at now? Was i flagged as “kick out and don’t let back in, at the slightest hint of it being ok to do so?”

https://x.com/ID_AA_Carmack/status/1699060703272010047?s=20

My response to John was that Elon was making poor decisions, destabilizing the platform, and I didn’t trust him to have good intentions.

Ok so why the suspension?

Have a look at what kicked it off. Trigger warning: police violence against women.

~~https://x.com/Atrix256/status/1702400552032936032?s=20~~ x link no longer works, so i put the youtube vid is at https://www.youtube.com/watch?v=RU4VGQruhQc . The “lucky that wasn’t fatal” attack happens 45 seconds in.

A local police department (this incident happened 30 minutes from my house, and my first game dev job is literally a street away) is shown having an altercation with a WEDDING party, in which the police start throwing punches at women, in nice dresses and high heels. One of them hits the back of her head against the ground and is knocked out.

Hitting the back of your head like that is no joke. That’s where your brain is connected to the rest of your body. My old martial arts teacher, an ex marine, called it the “dumb knot” and let us know how dangerous it was so we could train safely, and also know the implications for self defense type situations.

Seeing that pissed me off. Yes, there was some “drunken idiocy” going on, including reports that a woman tried to grab one of the officer’s guns (if they can be trusted). Assuming so, you might ask me “ok so what SHOULD they have done if not that?”. Honestly, not sure, but NOT THAT. The woman who got knocked out was essentially choke slammed to the ground to make her hit her head in a life threatening way. That was intentional – it was a guided fall straight onto the back of her head where her spine connects to her skull – and should be considered attempted murder, frankly.

What I believe specifically got me suspended though is someone made a comment that nobody got shot. My response was “yeah, those cops better watch out next time for a good guy with a gun”. Playing off the arguments about how the solution to gun violence is more armed people, and also that we are pretty powerless when police go beast mode like this, just acting completely out of line. How many videos have we seen of police violently beating people in handcuffs who never resisted? Especially protestors. Or reporters and MEDICS at protests. This bullshit has to stop.

What happened to you being the champion of free speech Elon? The truth is, you only want the speech you agree with to be free, and that makes you a dangerous idiot.

People considering going to mars with this guy, go watch Total Recall with Arnold Schwarzenegger to get an idea of what that would be like.

Anyhow, catch me on mastodon and blue sky assuming this suspension is permanent (It’s the halting problem, so no idea). It’s you, my connections and our awesome community that I value most, so if it ends up not being permanent, see you on x til it sinks or is under significantly different management – hopefully soon!

Also, someone make sure John Carmack sees this, as a better answer to his post I linked above 🙂

Generating Hamiltonian Cycles on Graphs with Prime Numbers of Nodes, Without Repeating Edges

The last post showed how random Hamiltonian cycles on a graph (a path that visits each node exactly once) could efficiently handle ranking of a large number of items, by doing sparse pairwise voting (https://blog.demofox.org/2023/09/01/sparse-pairwise-voting-or-tournaments-implementing-some3-voting/).

That algorithm needed to generate multiple Hamiltonian cycles which didn’t use any graph edges already used by other cycles, and I didn’t like that it did this by generating cycles randomly until it found one that is valid. The “randomly generate candidates until success” meant an erratic runtime that might not even find a solution in a reasonable amount of time.

This post shows a way to generate those cycles in a straightforward way and compares the resulting cycles to the algorithm in the previous post. The algorithm requires a prime number count of nodes, but if you don’t have a prime number count of nodes, you can pad the graph with extra nodes to make it reach the next prime number. If the cycles are for voting, you can make those padding nodes automatically lose to real nodes, without needing to do an actual vote. This algorithm also works with non prime number amounts of nodes if you relax what “works” means a bit; instead of making cycles, it will sometimes make isolated loops, but we talk about how that can still be useful.

The code that goes with this post lives at https://github.com/Atrix256/PairwiseVoting. The visual “star” diagrams were made with drawgraphs.py. The csv files for the graphs were made by main.cpp.

The Algorithm

The algorithm is pretty straightforward.

If you have N nodes and N is a prime number, you loop from 0 to N-1. nodeA is the loop index. nodeB is (nodeA + 1) % N. This is the first cycle.

For the second cycle, you loop again. nodeA is the loop index. nodeB is (nodeA + 2) % N.

For the third cycle, you loop again. nodeA is the loop index. nodeB is (nodeA + 3) % N.

To get all cycles, you repeat this, for each integer from 1 to N/2, but not including N/2.

Let’s see an example.

Example 1 – Five Nodes

For the first cycle, we use each node index 0,1,2,3,4 as nodeA, and add 1 to each, modulo 5, to get nodeB.

Node A	Node B	Connection
0	1	0 – 1
1	2	1 – 2
2	3	2 – 3
3	4	3 – 4
4	0	4 – 0

We next add 2:

Node A	Node B	Connection
0	2	0 – 2
1	3	1 – 3
2	4	2 – 4
3	0	3 – 0
4	1	4 – 1

N/2 is 2.5, so we should stop at this point. Let’s see what happens if we do 3 anyways.

Node A	node B	Connection
0	3	0 – 3
1	4	1 – 4
2	0	2 – 0
3	1	3 – 1
4	2	4 – 2

We got the same connections when we added 3, as when adding 2. We’ll see why shortly.

Example 2 – Thirteen Nodes

Just the diagrams, one cycle at a time, because otherwise it’s too messy.

Why Does This Work?

For the first cycle, we start with every node we have as nodeA, then add 1 and modulo by N to get each nodeB. That gives us N connections, and they are all unique. We have used up all the connections where the nodes are 1 index apart and no other cycle is going to make a connection with nodes that are 1 index apart. We know that this is a cycle and that we can get from any node to any other node because you can subtract 1 or add 1 repeatedly to get from any index to any other index.

For the second cycle, we do the same but add 2 and modulo by N to get nodeB. We again have N connections and they are all unique. We also have used up all the connections where the nodes are 2 indices apart and no other cycle is going to make any connection where the nodes are 2 indices apart.

For the second cycle, we can also get from any node to any other node in this cycle, but it’s less obvious why. When working with the integers modulo N, you have something called a ring, which is a type of group. If you start at any number on the ring, and repeatedly add some number M which is relatively prime to N (no common factors other than 1), you will visit all numbers 0 to N-1, and will visit them all once before repeating any. When N is a prime number, then EVERY number less than N is relatively prime to N.

You can try this yourself:

Pick some prime number N
Pick any number M between 1 and N-1.
Starting at zero (or any other value), add M, modulo by N, and write out the value.
Repeat step 3 and you’ll see that you’ll visit all N numbers and after N steps you will get back to where you started.

Let’s do this with N=7 and M=2:

Step	Value	Next Value
1	0	(0 + 2) % 7 = 2
2	2	(2 + 2) % 7 = 4
3	4	(4 + 2) % 7 = 6
4	6	(6 + 2) % 7 = 1
5	1	(1 + 2) % 7 = 3
6	3	(3 + 2) % 7 = 5
7	5	(5 + 2) % 7 = 0
8	0	(0 + 2) % 7 = 2

So, when N is prime, every time we add a specific value greater than 0 and less than N to all the node indices, we’ll get a cycle that uses edges that no other cycle will give us.

Why Start Cycles at 1 and Stop at N/2?

There are two explanations for why this is. One is a mathematical explanation and the other is a brick wall. Let’s look at the brick wall first, then talk about the mathematical reason.

If you have N nodes in a graph, the total number of connections you can make between nodes is N * (N-1) / 2. Tangent: if you think that looks like Gauss’ formula, you are correct! It’s Gauss’ formula for N-1.

Returning to example 1, that had 5 nodes which means there are 5*4/2 = 10 total connections.

Each Hamiltonian cycle on that graph takes 5 connections away from all possible connections, and with 10 connections total, that means there can only be 2 cycles.

5/2 is 2.5, so we do cycle 1, then cycle 2, and stop before we do cycle 3.

We literally cannot do more than 2 cycles when there are 5 nodes.

Looking at example 2, there are 13 nodes, which means there are 78 total connections.

Each cycle takes 13 connections, so all connections are used up after 6 cycles.

13/2 is 6.5, so we know to stop at 6 before going to 7.

The brick wall limitation is that after N/2 cycles, we have no more edges to make cycles out of!

The mathematical explanation involves rings. To quote the previous section: When working with the integers modulo N, you have something called a ring, which is a type of group.

In rings, there is an interesting connection between the numbers less than N/2 and the numbers greater than N/2. Adding a number greater than N/2 is the same as subtracting a number less than N/2. As an example, let’s look at N=7 in the table below. We can see that adding 5 is the same as subtracting 2.

Starting Value	(Starting Value + 5) % 7	(Starting Value – 2) % 7
0	5	5
1	6	6
2	0	0
3	1	1
4	2	2
5	3	3
6	4	4

This means that if we have a connection where nodeB = (nodeA + 5) % 7, it is exactly the same as the connection where nodeA = (nodeB + 2) % 7. More plainly, cycle 5 will make the same connections as cycle 2. The nodeA and nodeB values will be swapped, but it will be the same connections.

Cycle N/2 is the point where positive numbers become the same as smaller negative numbers, so that is where we stop. Since all prime numbers (greater than 2) are odd, N/2 will never be an integer, so we don’t have to think about what to do at exactly N/2.

What if N isn’t Prime?

If N isn’t prime, not all “cycles” will be cycles. The algorithm will visit all nodes, but there will not be a path from each node to every other node. Here are the cycles for where N=9.

With N=9, only cycles 1,2 and 4 make cycles. Cycle 3 makes three loops: 0,3,6 and 1,4,7 and 2,5,8. If you wanted to use this algorithm to make connections in a graph for voting or for a tournament, you could still do so, but the first cycle you do should be a coprime (cycle 1 is an easy one to choose as the first cycle), so that the graph was connected from the beginning, and these loops would increase connectivity of the graph, without having to be fully fledged Hamiltonian cycles.

Allowing N not to be prime also allows an even numbered N. In that case, when you create cycles from 1 to N/2, should you include N/2 or not? Let’s see what happens when N=6, for cycles 1,2 and 3.

What happened to cycle 3?

Cycle 3 made all the connections where the difference between the node indices was 3, and it made 6 connections. Unfortunately, there are only 3 unique connections where the difference in node indices is 3: 0-3, 1-4, 2-5. Cycle 3 added the following redundant connections: 3-0, 4-1, 5-2. The answer about whether you should include N/2 if using a non prime number of nodes is ultimately up to you, but you should know that when you do so, you’ll get redundant connections. On the flip side, if you don’t include them, those are node connections that you will never be able to make otherwise, using this algorithm. You’ll have to decide which way you want to go. You could always pad with another node to make it odd, and dodge this issue 🙂

Is This Random Enough?

When you see that the first cycle gives you the connections in order 0 – 1, 1 – 2, 2 – 3, etc. you may wonder if this method is making random enough cycles.

In defense of this seeming lack of randomness, keep in mind that node indices are just arbitrary names of nodes. Two nodes that have an index difference of 1 are not any more, or less, similar than two other nodes that have an index difference of 5. The numbers are just arbitrary labels.

However, the lack of randomness does come up on subsequent cycles. Think of the path from node A to node B on graphs made this way as a simplified path finding problem. After you’ve done 1 cycle, you know you can take steps of size 1 from any node to any other node. After you’ve done 2 cycles, you know you can take steps of size 2. After 3 cycles you can take steps of size 3 and so on. Using this knowledge, you can craft a path from A to B using the largest steps to get as close as possible, then a smaller step as a remainder. This is nice and orderly, but when graph edges are taken at random, there will be shortcuts sprinkled throughout the graph. Like if in the first cycle of a 1000 node graph, if there is a link from node 0 to node 500, that will drastically reduce the length of the path between the most distant nodes.

So on one hand, the lack of randomness doesn’t matter, but on another hand, it does.

There are some ways to inject randomness into this algorithm if you want to though. One easy thing you can do is shuffle the nodes before running this algorithm. Another thing you can do is perform the cycles in a random order. If you have 1009 nodes, that gives you 1009/2=504 cycles. The last post showed that you only need a handful of cycles to get a good result if using cycles for voting or tournaments. Instead of doing cycles 1, 2, and 3, you could randomize the order and do it an order like 100, 4, 235. These points of randomization could help make up for lack of randomization in the cycles themselves, and are also helpful if you ever need multiple sets of non overlapping Hamiltonian cycles. That allows this algorithm to take in a “random seed” to get different results for different seeds, while remaining deterministic.

Results In Graph Quality Tests

To understand the quality of these Hamiltonian cycles, I plugged them into the voting algorithm from the last blog post. I put in a version that does cycles in order (1, 2, 3 and so on), and also put in a version that does the cycles in a random order, but makes sure the first cycle is always 1, to ensure the graph is connected, in case the number of nodes is not prime.

The graphs below are the average of 100 tests. The source code generates csv files with more information, including standard deviation.

For radius, cycles are randomly generated and rings shuffled are shuffled differently each run. The non shuffled rings are the same each test. For rank distance, the actual score value of each node changes each test, so all rank distance graphs are subject to randomization in each test. This can also be seen as shuffling the nodes before running the algorithm, which is sometime I talked about in the last section.

For both radius and rank distance, a lower score is better.

It looks like random cycles (the technique from the last post) decreases the radius the most. That is not surprising, since that is based on solid mathematics to reduce the radius quickly. Using the deterministic cycles in this post, but shuffling the order the cycles looks to be a pretty good method for reducing radius too. It isn’t as good as random cycles, but it can quickly and easily find as many non overlapping Hamiltonian cycles as you want, up to the full limit of how many cycles are possible. Randomly generating cycles until they are valid gets very slow for high cycle counts and high node counts.

These graphs tell a different story for the rank distance. These show that the NON SHUFFLED ordering of deterministic cycles do the best for making the most accurate ranked list. I’m a bit surprised, as this shows that decreasing radius may be a good idea for improving the ranked list quality, but it isn’t the whole story for making the highest quality ranked list approximation.

Perhaps this has something to do with radius being measured as the worst outlier, while the rank distance is calculated as an average quality of all nodes.

Quick recap on rank distance: The rank distance is calculated by using the voting algorithm from last post to make a ranked list, and then summing up the difference of each node’s placement in that estimated ranking list, versus the nodes placement in the actual ranking list. It’s the optimal transport cost to change the estimated ranking list into the actual ranking list. That value is divided by the number of nodes to try and normalize it.

Anyhow, thanks for reading and thanks to this video by CQFD for nerd sniping me with graph theory 🙂
https://www.youtube.com/watch?v=XSDBbCaO-kc