Shamir’s Quest: Collect Any 3 Keys To Unlock The Secret!

This post is on something called Shamir’s Secret Sharing. It’s a technique where you can break a secret number up into M different pieces, where if you have any N of those M pieces, you are able to figure out the secret.

Thinking of it in video game terms, imagine there are 10 keys hidden in a level, but you can escape the level whenever you find any 7 of them. This is what Shamir’s Secret Sharing enables you to set up cryptographically.

Interestingly in this case, the term sharing in “secret sharing” doesn’t mean sharing the secret with others. It means breaking the secret up into pieces, or SHARES. Secret sharing means that you make shares out of a secret, such that if you have enough of the shares, you can recover the secret.

How Do You Share (Split) The Secret?

The basic idea of how it works is actually really simple. This is good for us trying to learn the technique, but also good to show it’s security since there are so few moving parts.

It relies on something called the Unisolvence Theorem which is a fancy label meaning these things:

  • If you have a linear equation, it takes two (x,y) points to uniquely identify that line. No matter how you write a linear equation, if it passes through those same two points, it’s mathematically equivelant.
  • If you have a quadratic equation, it takes three (x,y) points to uniquely identify that quadratic curve. Again, no matter how you write a quadratic equation, if it passes through those same three points, it’s mathematically equivalent.
  • The pattern continues for equations of any degree. Cubic equations require four points to be uniquely identified, Quartic equations require five points, and so on.

At a high level, how this technique works is that the number of shares (keys) you want someone to collect (N ) defines the degree of an equation.

You use random numbers as the coefficients of the powers of x in that equation, but use your secret number as the constant term.

You then create M data points of the form (x,y) aka (x,f(x)) . Those are your shares. You then give individual shares to people, or go hide them in your dungeon or do whatever you are going to do with them.

As soon as any one person has N of those M shares (data points), they will be able to figure out the equation of the curve and thus get the secret.

The secret number is the constant term of the polynomial, which is also just f(0) .

This image below from wikipedia is great for seeing how you may have two points of a cubic curve, but without a third point you can’t be sure what the quadratic equation is. In fact, there are an infinite number of quadratic curves that pass through any two points! Because of that, it takes the full number of required shares for you to be able to unlock the secret.

Example: Sharing (Splitting) The Secret

First you decide how many shares you want it to take to unlock the secret. This determines the degree of your equation.

Let’s say you wanted a person to have to have four shares to unlock the secret. This means our equation will be a cubic equation, since it takes four points to uniquely define a cubic equation.

Our equation is:

f(x) = R_1x^3 + R_2x^2 + R_3x + S

Where the R_i values are random numbers, and S is the secret value.

Let’s say that our secret value is 435, and that we picked some random numbers for the equation, making the below:

f(x) = 28x^3 + 64x^2 + 9x + 435

We now have a function that is uniquely identifiable by any 4 points of data on it’s curve.

Next we decide how many pieces we are going to create total. We need at least 4 so that it is in fact solvable. Let’s make 6 shares.

To do this, you just plug in 6 different values of x and pair each x value with it’s y value. Let’s do that:

\begin{array}{c|c}  x & f(x) \\  \hline  1 & 536 \\  2 & 933 \\  3 & 1794 \\  4 & 3287 \\  5 & 5580 \\  6 & 8841 \\  \end{array}

When doing this part, remember that the secret number is f(0) , so make sure and not share what the value of the function is when x is 0!

You could then distribute the shares (data pairs) as you saw fit. Maybe some people are more important, so you give them more than one share, requiring a smaller amount of cooperation with them to unlock the secret.

Share distribution details are totally up to you, but we now have our shares, whereby if you have any of the 4 of the 6 total shares, you can unlock the secret.

How Do You Join The Secret?

Once you have the right number of shares and you know the degree of the polynomial (pre-shared “public” information), unlocking the secret is a pretty straightforward process too. To unlock the secret, you just need to use ANY method available for creating an equation of the correct degree from a set of data points.

This can be one of several different interpolation techniques, but the most common one to use seems to be Lagrange interpolation, which is something I previously wrote up that you can read about here: Lagrange Interpolation.

Once you have the equation, you can either evaluate f(0) , or you can write the equation in polynomial form and the constant term will be the secret value.

Example: Joining the Secret

Let’s say that we have these four shares and are ready to get the cubic function and then unlock the secret number:

\begin{array}{c|c}  x & y \\  \hline  1 & 536 \\  2 & 933 \\  4 & 3287 \\  6 & 8841 \\  \end{array}

We could bust out some Lagrange interpolation and figure this out, but let’s be lazy… err efficient I mean. Wolfram alpha can do this for us!

Wolfram Alpha: cubic fit (1, 536), (2, 933), (4, 3287), (6, 8841)

That gives us this equation, saying that it is a perfect fit (which it is!)
28x^3 + 64x^2 + 9x + 435

You can see that our constant term (and f(0) ) is the correct secret value of 435.

Daaaayummm Bru… that is lit AF! We just got hacked by wolfram alpha 😛

A Small Complication

Unfortunately, the above has a weakness. The weakness is that each share you get gives you a little bit more information about the secret value. You can read more about this in the links section at the end if you want to know more details.

Ideally, you wouldn’t have any information about the secret value until you had the full number of shares required to unlock the secret.

To address this problem, we are going to choose some prime number k and instead of shares being (x,y) data points on the curve, they are going to be (x,y \bmod k) . In technical terms we are going to be using points on a finite field, or a Galois field.

The value we choose for k needs to be larger than any of the coefficients of our terms (the random numbers) as well as larger than our secret value and larger than the number of shares we want to create. The larger the better besides that, because a larger k value means a larger “brute force” space to search.

If you want to use this technique in a situation which has real needs for security, please make sure and read more on this technique from more authoritative sources. I’m glossing over the details of security quite a bit, and just trying to give an intuitive understanding of this technique (:

Source Code

Below is some sample source code that implements Shamir’s Secret Sharing in C++.

I use 64 bit integers, but if you were going to be using this in a realistic situation you could very well overflow 64 bit ints and get the wrong answers. I hit this problem for instance when trying to require more than about 10 shares, using a prime of 257, and generating 50 shares. If you hit the limit of 64 bit ints you can use a multi precision math library instead to have virtually unlimited sized ints. The boost multiprecision header library is a decent choice for multi precision integers, specifically cpp_int.

#include <stdio.h>
#include <array>
#include <vector>
#include <math.h>
#include <random>
#include <assert.h>
#include <stdint.h>
#include <inttypes.h>

typedef int64_t TINT;
typedef std::array<TINT, 2> TShare;
typedef std::vector<TShare> TShares;

class CShamirSecretSharing
    CShamirSecretSharing (size_t sharesNeeded, TINT prime)
        : c_sharesNeeded(sharesNeeded), c_prime(prime)
        // There needs to be at least 1 share needed
        assert(sharesNeeded > 0);

    // Generate N shares for a secretNumber
    TShares GenerateShares (TINT secretNumber, TINT numShares) const
        // calculate our curve coefficients
        std::vector<TINT> coefficients;
            // store the secret number as the first coefficient;
            coefficients[0] = secretNumber;

            // randomize the rest of the coefficients
            std::array<int, std::mt19937::state_size> seed_data;
            std::random_device r;
            std::generate_n(, seed_data.size(), std::ref(r));
            std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
            std::mt19937 gen(seq);
            std::uniform_int_distribution<TINT> dis(1, c_prime - 1);
            for (TINT i = 1; i < c_sharesNeeded; ++i)
                coefficients[(size_t)i] = dis(gen);

        // generate the shares
        TShares shares;
        for (size_t i = 0; i < numShares; ++i)
            shares[i] = GenerateShare(i + 1, coefficients);
        return shares;

    // use lagrange polynomials to find f(0) of the curve, which is the secret number
    TINT JoinShares (const TShares& shares) const
        // make sure there is at elast the minimum number of shares
        assert(shares.size() >= size_t(c_sharesNeeded));

        // Sigma summation loop
        TINT sum = 0;
        for (TINT j = 0; j < c_sharesNeeded; ++j)
            TINT y_j = shares[(size_t)j][1];

            TINT numerator = 1;
            TINT denominator = 1;

            // Pi product loop
            for (TINT m = 0; m < c_sharesNeeded; ++m)
                if (m == j)

                numerator = (numerator * shares[(size_t)m][0]) % c_prime;
                denominator = (denominator * (shares[(size_t)m][0] - shares[(size_t)j][0])) % c_prime;

            sum = (c_prime + sum + y_j * numerator * modInverse(denominator, c_prime)) % c_prime;
        return sum;

    const TINT GetPrime () const { return c_prime; }
    const TINT GetSharesNeeded () const { return c_sharesNeeded; }


    // Generate a single share in the form of (x, f(x))
    TShare GenerateShare (TINT x, const std::vector<TINT>& coefficients) const
        TINT xpow = x;
        TINT y = coefficients[0];
        for (TINT i = 1; i < c_sharesNeeded; ++i) {
            y += coefficients[(size_t)i] * xpow;
            xpow *= x;
        return{ x, y % c_prime };

    // Gives the decomposition of the gcd of a and b.  Returns [x,y,z] such that x = gcd(a,b) and y*a + z*b = x
    static const std::array<TINT, 3> gcdD (TINT a, TINT b) {
        if (b == 0)
            return{ a, 1, 0 };

        const TINT n = a / b;
        const TINT c = a % b;
        const std::array<TINT, 3> r = gcdD(b, c);

        return{ r[0], r[2], r[1] - r[2] * n };

    // Gives the multiplicative inverse of k mod prime.  In other words (k * modInverse(k)) % prime = 1 for all prime > k >= 1 
    static TINT modInverse (TINT k, TINT prime) {
        k = k % prime;
        TINT r = (k < 0) ? -gcdD(prime, -k)[2] : gcdD(prime, k)[2];
        return (prime + r) % prime;

    // Publically known information
    const TINT          c_prime;
    const TINT          c_sharesNeeded;

void WaitForEnter ()
    printf("Press Enter to quit");

int main (int argc, char **argv)
    // Parameters
    const TINT c_secretNumber = 435;
    const TINT c_sharesNeeded = 7;
    const TINT c_sharesMade = 50;
    const TINT c_prime = 439;   // must be a prime number larger than the other three numbers above

    // set up a secret sharing object with the public information
    CShamirSecretSharing secretSharer(c_sharesNeeded, c_prime);

    // split a secret value into multiple shares
    TShares shares = secretSharer.GenerateShares(c_secretNumber, c_sharesMade);

    // shuffle the shares, so it's random which ones are used to join
    std::array<int, std::mt19937::state_size> seed_data;
    std::random_device r;
    std::generate_n(, seed_data.size(), std::ref(r));
    std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
    std::mt19937 gen(seq);
    std::shuffle(shares.begin(), shares.end(), gen);

    // join the shares
    TINT joinedSecret = secretSharer.JoinShares(shares);

    // show the public information and the secrets being joined
    printf("%" PRId64 " shares needed, %i shares made\n", secretSharer.GetSharesNeeded(), shares.size());
    printf("Prime = %" PRId64 "\n\n", secretSharer.GetPrime());
    for (TINT i = 0, c = secretSharer.GetSharesNeeded(); i < c; ++i)
        printf("Share %" PRId64 " = (%" PRId64 ", %" PRId64 ")\n", i+1, shares[i][0], shares[i][1]);

    // show the result
    printf("\nJoined Secret = %" PRId64 "\nActual Secret = %" PRId64 "\n\n", joinedSecret, c_secretNumber);
    assert(joinedSecret == c_secretNumber);
    return 0;

Example Output

Here is some example output of the program:


Wikipedia: Shamir’s Secret Sharing (Note: for some reason the example javascript implementation here only worked for odd numbered keys required)
Wikipedia: Finite Field Shamir’s Secret Sharing
Java Implementation of Shamir’s Secret Sharing (Note: I don’t think this implementation is correct, and neither is the one that someone posted to correct them!)

When writing this post I wondered if maybe you could use the coefficients of the other terms as secrets as well. These two links talk about the details of that:
Cryptography Stack Exchange: Why only one secret value with Shamir’s secret sharing?
Cryptography Stack Exchange: Coefficients in Shamir’s Secret Sharing Scheme

Now that you understand this, you are probably ready to start reading up on elliptic curve cryptography. Give this link below a read if you are interested in a gentle introduction on that!
A (Relatively Easy To Understand) Primer on Elliptic Curve Cryptography



Turning a Truth Table Into A digital Circuit (ANF)

In this post I’m going to show how you turn a truth table into a digital logic circuit that uses XOR and AND gates.

My Usage Case

My specific usage case for this is in my investigations into homomorphic encryption, which as you may recall is able to perform computation on encrypted data. This lets encrypted data be operated on by an untrusted source, given back to you, and then you can decrypt your data to get a result.

Lots of use cases if this can ever get fast enough to become practical, such as doing cloud computing with private data. However, when doing homomorphic encryption (at least currently, for the techniques I’m using), you only have XOR and AND logic operations.

So, I’m using the information in this post to be able to turn a lookup table, or a specific boolean function, into a logic circuit that I can feed into a homomorphic encryption based digital circuit.

Essentially I want to figure out how to do a homomorphic table lookup to try and make some simple as possible circuits, that will in turn be as fast and lean as possible.

If you want to know more about homomorphic encryption, here’s a post I wrote which explains a very simple algorithm: Super Simple Symmetric Leveled Homomorphic Encryption Implementation

Algebraic Normal Form

Algebraic normal form (ANF) is a way of writing a boolean function using only XOR and AND.

Since it’s a normal form, two functions that do the same thing will be the same thing in ANF.

There are other forms for writing boolean logic, but ANF suits me best for my homomorphic encryption circuit needs!

An example of boolean logic in ANF is the below:

f(x_1, x_2, x_3, x_4) = x_1 x_2 \oplus x_1 x_3 \oplus x_1 x_4

It is essentially a boolean polynomial, where AND is like multiplication, and XOR is like addition. It even factors the same way. In fact, ANF is not always the smallest circuit possible, you’d have to factor common ANDs to find the smallest way you could represent the circuit, like the below:

f(x_1, x_2, x_3, x_4) = x_1 (x_2 \oplus x_3 \oplus x_4)

That smaller form does 1 AND and 2 XORs, versus the ANF which does 3 ANDs and 2 XORs. In homomorphic encryption, since AND is so much more costly than XOR, minimizing the ANDs is a very nice win, and worth the effort.

Wikipedia has some more info about ANF here: Wikipedia: Algebraic normal form

Truth Tables and Lookup Tables

A truth table is just where you specify the inputs into a boolean function and the output of that boolean function for the given input:

\begin{array}{c|c|c|c}  x_1 & x_2 & x_3 & f(x_1, x_2, x_3) \\  \hline  0 & 0 & 0 & 0 \\  0 & 0 & 1 & 1 \\  0 & 1 & 0 & 1 \\  0 & 1 & 1 & 0 \\  1 & 0 & 0 & 1 \\  1 & 0 & 1 & 0 \\  1 & 1 & 0 & 0 \\  1 & 1 & 1 & 1 \\  \end{array}

A lookup table is similar in functionality, except that it has multi bit output. When dealing with digital circuits, you can make a lookup table by making a truth table per output bit. For instance, the above truth table might just be the low bit of the lookup table below, which is just a truth table for addition of the input bits.

\begin{array}{c|c|c|c}  x_1 & x_2 & x_3 & f(x_1, x_2, x_3) \\  \hline  0 & 0 & 0 & 00 \\  0 & 0 & 1 & 01 \\  0 & 1 & 0 & 01 \\  0 & 1 & 1 & 10 \\  1 & 0 & 0 & 01 \\  1 & 0 & 1 & 10 \\  1 & 1 & 0 & 10 \\  1 & 1 & 1 & 11 \\  \end{array}

Converting Truth Table to ANF

When I first saw the explanation for converting a truth table to ANF, it looked pretty complicated, but luckily it turns out to be pretty easy.

The basic idea is that you make a term for each possible combination of x inputs, ANDing a term by each constant, and then solving for those constants.

Let’s use the truth table from the last section:

\begin{array}{c|c|c|c}  x_1 & x_2 & x_3 & f(x_1, x_2, x_3) \\  \hline  0 & 0 & 0 & 0 \\  0 & 0 & 1 & 1 \\  0 & 1 & 0 & 1 \\  0 & 1 & 1 & 0 \\  1 & 0 & 0 & 1 \\  1 & 0 & 1 & 0 \\  1 & 1 & 0 & 0 \\  1 & 1 & 1 & 1 \\  \end{array}

For three inputs, the starting equation looks like this:

f(x_1, x_2, x_3) = \\  a_0 \\  \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3 \\  \oplus a_{12} x_1 x_2 \oplus a_{13} x_1 x_3 \oplus a_{23} x_2 x_3 \\  \oplus a_{123} x_1 x_2 x_3

Now we have to solve for the a values.

To solve for a_{123} , we just look in the truth table for function f(x_1, x_2, x_3) to see if we have an odd or even number of ones in the output of the function. If there is an even number, it is 0, else it is a 1.

Since we have an even number of ones, the value is 0, so our equation becomes this:

f(x_1, x_2, x_3) = \\  a_0 \\  \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3 \\  \oplus a_{12} x_1 x_2 \oplus a_{13} x_1 x_3 \oplus a_{23} x_2 x_3 \\  \oplus 0 \land x_1 x_2 x_3

Note that \land is the symbol for AND. I’m showing it explicitly because otherwise the equation looks weird, and a multiplication symbol isn’t correct.

Since 0 ANDed with anything else is 0, and also since n XOR 0 = n, that whole last term disappears, leaving us with this equation:

f(x_1, x_2, x_3) = \\  a_0 \\  \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3 \\  \oplus a_{12} x_1 x_2 \oplus a_{13} x_1 x_3 \oplus a_{23} x_2 x_3

Next up, to solve for a_{12} , we need to limit our truth table to f(x_1, x_2, 0) . That truth table is below, made from the original truth table, but throwing out any row where x_{3} is 1.

\begin{array}{c|c|c|c}  x_1 & x_2 & x_3 & f(x_1, x_2, 0) \\  \hline  0 & 0 & 0 & 0 \\  0 & 1 & 0 & 1 \\  1 & 0 & 0 & 1 \\  1 & 1 & 0 & 0 \\  \end{array}

We again just look at whether there are an odd or even number of ones in the function output, and use that to set a_{12} appropriately. In this case, there are an even number, so we set it to 0, which makes that term disappear again. Our function is now down to this:

f(x_1, x_2, x_3) = \\  a_0 \\  \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3 \\  \oplus a_{13} x_1 x_3 \oplus a_{23} x_2 x_3

If we look at f(x_1,0,x_3) , we find that it also has an even number of ones, making a_{13} become 0 and making that term disappear.

Looking at f(0,x_2,x_3) , it also has an even number of ones, making a_{23} become 0 and making that term disappear as well.

That leaves us with this equation:

f(x_1, x_2, x_3) = \\  a_0 \\  \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3

To solve for a_1 , we look at the truth table for f(x_1,0,0) , which is below:

\begin{array}{c|c|c|c}  x_1 & x_2 & x_3 & f(x_1, 0, 0) \\  \hline  0 & 0 & 0 & 0 \\  1 & 0 & 0 & 1 \\  \end{array}

There are an odd number of ones in the output, so a_1 becomes 1. Finally, we get to keep a term! The equation is below:

f(x_1, x_2, x_3) = \\  a_0 \\  \oplus 1 \land x_1 \oplus a_2 x_2 \oplus a_3 x_3

Since 1 AND n = n, we can drop the explicit 1 to become this:

f(x_1, x_2, x_3) = \\  a_0 \\  \oplus x_1 \oplus a_2 x_2 \oplus a_3 x_3

If you do the same process for a_2 and a_3 , you’ll find that they also have odd numbers of ones in the output so also become ones. That puts our equation at:

f(x_1, x_2, x_3) = \\  a_0 \\  \oplus x_1 \oplus x_2 \oplus x_3

Solving for a_0 , is just looking at whether there are an odd or even number of ones in the function f(0,0,0) which you can look up directly in the lookup table. It’s even, so a_0 becomes 0, which makes our full final equation into this:

f(x_1, x_2, x_3) = x_1 \oplus x_2 \oplus x_3

We are done! This truth table can be implemented with 3 XORs and 0 ANDs. A pretty efficient operation!

You can see this is true if you work it out with the truth table. Try it out and see!

\begin{array}{c|c|c|c}  x_1 & x_2 & x_3 & f(x_1, x_2, x_3) \\  \hline  0 & 0 & 0 & 0 \\  0 & 0 & 1 & 1 \\  0 & 1 & 0 & 1 \\  0 & 1 & 1 & 0 \\  1 & 0 & 0 & 1 \\  1 & 0 & 1 & 0 \\  1 & 1 & 0 & 0 \\  1 & 1 & 1 & 1 \\  \end{array}

Sample Code

Here is some sample code that lets you define a lookup table by implementing an integer function, and it generates the ANF for each output bit for the truth table. It also verifies that the ANF gives the correct answer. It shows you how to use this to make various circuits: bit count, addition, multiplication, division and modulus.

#include <stdio.h>
#include <array>
#include <vector>

#define PRINT_NUMOPS() 1
#define PRINT_ANF() 1

void WaitForEnter ()
    printf("Press Enter to quit");

template <size_t NUM_INPUT_BITS>
bool LookupEntryPassesMask (size_t entry, size_t mask)
    for (size_t i = 0; i < NUM_INPUT_BITS; ++i)
        const size_t bitMask = 1 << i;
        const bool allowOnes = (mask & bitMask) != 0;
        const bool bitPassesMask = allowOnes || (entry & bitMask) == 0;
        if (!bitPassesMask)
            return false;
    return true;

template <size_t NUM_INPUT_BITS>
bool ANFHasTerm (const std::array<size_t, 1 << NUM_INPUT_BITS> &lookupTable, size_t outputBitIndex, size_t termMask)
    const size_t c_inputValueCount = 1 << NUM_INPUT_BITS;

    int onesCount = 0;
    for (size_t i = 0; i < c_inputValueCount; ++i)
        if (LookupEntryPassesMask<NUM_INPUT_BITS>(i, termMask) && ((lookupTable[i] >> outputBitIndex) & 1) != 0)

    return (onesCount & 1) != 0;

template <size_t NUM_INPUT_BITS>
void MakeANFTruthTable (const std::array<size_t, 1 << NUM_INPUT_BITS> &lookupTable, std::array<size_t, 1 << NUM_INPUT_BITS> &reconstructedLookupTable, size_t outputBitIndex)
    const size_t c_inputValueCount = 1 << NUM_INPUT_BITS;
    printf("-----Output Bit %u-----\r\n", outputBitIndex);

    // print truth table if we should
        for (size_t inputValue = 0; inputValue < c_inputValueCount; ++inputValue)
            printf("  [%u] = %u\r\n", inputValue, ((lookupTable[inputValue] >> outputBitIndex) & 1) ? 1 : 0);

    // find each ANF term
    std::vector<size_t> terms;
    for (size_t termMask = 0; termMask < c_inputValueCount; ++termMask)
        if (ANFHasTerm<NUM_INPUT_BITS>(lookupTable, outputBitIndex, termMask))

    // print function params
    #if PRINT_ANF()
        for (size_t i = 0; i < NUM_INPUT_BITS; ++i)
            if (i > 0)
        printf(") = \r\n");

    // print ANF and count XORs and ANDs
    size_t numXor = 0;
    size_t numAnd = 0;
    if (terms.size() == 0)
        #if PRINT_ANF()
        for (size_t termIndex = 0, termCount = terms.size(); termIndex < termCount; ++termIndex)
            if (termIndex > 0) {
                #if PRINT_ANF()
                printf("XOR ");

            size_t term = terms[termIndex];
            if (term == 0)
                #if PRINT_ANF()
                bool firstProduct = true;
                for (size_t bitIndex = 0; bitIndex < NUM_INPUT_BITS; ++bitIndex)
                    const size_t bitMask = 1 << bitIndex;
                    if ((term & bitMask) != 0)
                        #if PRINT_ANF()
                        printf("x%i ", bitIndex + 1);
                        if (firstProduct)
                            firstProduct = false;
            #if PRINT_ANF()
    #if PRINT_ANF()

    #if PRINT_NUMOPS()
    printf("%u XORs, %u ANDs\r\n\r\n", numXor, numAnd);

    // reconstruct a bit of the reconstructedLookupTable for each entry to be able to verify correctness
    const size_t c_outputBitMask = 1 << outputBitIndex;
    for (size_t valueIndex = 0; valueIndex < c_inputValueCount; ++valueIndex)
        bool xorSum = false;
        for (size_t termIndex = 0, termCount = terms.size(); termIndex < termCount; ++termIndex)
            size_t term = terms[termIndex];
            if (term == 0)
                xorSum = 1 ^ xorSum;
                bool andProduct = true;
                for (size_t bitIndex = 0; bitIndex < NUM_INPUT_BITS; ++bitIndex)
                    const size_t bitMask = 1 << bitIndex;
                    if ((term & bitMask) != 0)
                        if ((valueIndex & bitMask) == 0)
                            andProduct = false;
                xorSum = andProduct ^ xorSum;
        if (xorSum)
            reconstructedLookupTable[valueIndex] |= c_outputBitMask;

template <size_t NUM_INPUT_BITS, size_t NUM_OUTPUT_BITS, typename LAMBDA>
void MakeANFLookupTable (const LAMBDA& lambda)
    // make lookup table
    const size_t c_outputValueMask = (1 << NUM_OUTPUT_BITS) - 1;
    const size_t c_inputValueCount = 1 << NUM_INPUT_BITS;
    std::array<size_t, c_inputValueCount> lookupTable;
    for (size_t inputValue = 0; inputValue < c_inputValueCount; ++inputValue)
        lookupTable[inputValue] = lambda(inputValue, NUM_INPUT_BITS, NUM_OUTPUT_BITS) & c_outputValueMask;

    // make the anf for each truth table (each output bit of the lookup table)
    std::array<size_t, c_inputValueCount> reconstructedLookupTable;
    std::fill(reconstructedLookupTable.begin(), reconstructedLookupTable.end(), 0);
    for (size_t outputBitIndex = 0; outputBitIndex < NUM_OUTPUT_BITS; ++outputBitIndex)
        MakeANFTruthTable<NUM_INPUT_BITS>(lookupTable, reconstructedLookupTable, outputBitIndex);

    // verify that our anf expressions perfectly re-create the lookup table
    for (size_t inputValue = 0; inputValue < c_inputValueCount; ++inputValue)
        if (lookupTable[inputValue] != reconstructedLookupTable[inputValue])
            printf("ERROR: expression / lookup mismatch for index %u\r\n", inputValue);
    printf("expression / lookup verification complete.\r\n\r\n");

size_t CountBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
    // Count how many bits there are
    int result = 0;
    while (inputValue)
        if (inputValue & 1)
        inputValue = inputValue >> 1;
    return result;

size_t AddBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
    // break the input bits in half and add them
    const size_t bitsA = numInputBits / 2;
    const size_t mask = (1 << bitsA) - 1;

    size_t a = inputValue & mask;
    size_t b = inputValue >> bitsA;
    return a+b;

size_t MultiplyBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
    // break the input bits in half and add them
    const size_t bitsA = numInputBits / 2;
    const size_t mask = (1 << bitsA) - 1;

    size_t a = inputValue & mask;
    size_t b = inputValue >> bitsA;

    return a * b;

size_t DivideBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
    // break the input bits in half and add them
    const size_t bitsA = numInputBits / 2;
    const size_t mask = (1 << bitsA) - 1;

    size_t a = inputValue & mask;
    size_t b = inputValue >> bitsA;

    // workaround for divide by zero
    if (b == 0)
        return 0;

    return a / b;

size_t ModulusBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
    // break the input bits in half and add them
    const size_t bitsA = numInputBits / 2;
    const size_t mask = (1 << bitsA) - 1;

    size_t a = inputValue & mask;
    size_t b = inputValue >> bitsA;

    // workaround for divide by zero
    if (b == 0)
        return 0;

    return a % b;

int main (int argc, char **argv)
    //MakeANFLookupTable<3, 2>(CountBits);    // Output bits needs to be enough to store the number "input bits"
    //MakeANFLookupTable<4, 3>(AddBits);      // Output bits needs to be (InputBits / 2)+1
    //MakeANFLookupTable<4, 4>(MultiplyBits); // Output bits needs to be same as input bits
    //MakeANFLookupTable<4, 2>(DivideBits);   // Output bits needs to be half of input bits (rounded down)
    //MakeANFLookupTable<4, 2>(ModulusBits);  // Output bits needs to be half of input bits (rounded down)
    //MakeANFLookupTable<10, 5>(DivideBits);  // 5 bit vs 5 bit division is amazingly complex!
    MakeANFLookupTable<4, 2>(ModulusBits);  // Output bits needs to be half of input bits (rounded down)
    return 0;

Sample Code Runs

Here is the program output for a “bit count” circuit. It counts the number of bits that are 1, in the 3 bit input, and outputs the answer as 2 bit output. Note that the bit 0 output is the same functionality as the example we worked through by hand, and you can see that it comes up with the same answer.

Here is the program output for an adder circuit. It adds two 2 bit numbers, and outputs a 3 bit output.

Here is the program output for a multiplication circuit. It multiplies two 2 bit numbers, and outputs a 4 bit number.

Here is the program output for a division circuit. It divides a 2 bit number by another 2 bit number and outputs a 2 bit number. When higher bit counts are involved, the division circuit gets super complicated, it’s really crazy! 5 bit divided by 5 bit is several pages of output for instance. Note that it returns 0 whenever it would divide by 0.

Lastly, here is the program output for a modulus circuit. It divides a 2 bit number by another 2 bit number and outputs the remainder as a 2 bit number.

Closing and Links

While the above shows you how to turn a single bit truth table into ANF, extending this to a multi bit lookup table is super simple; you just do the same process for each output bit in the lookup table.

Here are a few links in case anything above is unclear, or you want more information.

Finding Boolean/Logical Expressions for truth tables in algebraic normal form(ANF)

Finding Boolean/Logical Expressions for truth tables



Game Development Needs Data Pipeline Middleware

In 15 years I’ve worked at 7 different game studios, ranging from small to large, working on many different kinds of projects in a variety of roles.

At almost every studio, there was some way for the game to load data at runtime that controlled how it behaved – such as the damage a weapon would do or the cost of an item upgrade.

The studios that didn’t have this setup could definitely have benefited from having it. After all, this is how game designers do their job!

Sometimes though, this data was maintained via excel spreadsheets (export as csv for instance and have the game read that). That is nearly the worst case scenario for data management. Better though is to have an editor which can edit that data, preferably able to edit data described by schemas, which the game also uses to generate code to load that data.

Each studio I’ve worked at that did have game data each had their own solution for their data pipeline, and while they are all of varying qualities, I have yet to see something that is both fast and has most of the features you’d reasonably want or expect.

We really need some middleware to tackle this “solved problem” and offer it to us at a reasonable price so we can stop dealing with it. Open sourced would be fine too. Everyone from engineers to production to content people will be much happier and more productive!

Required Features

Here are the features I believe are required to satisfy most folks:

  1. Be able to define the structure of your data in some format (define data schema).
  2. Have an editor that is able to quickly launch, quickly load up data in the data schema and allow a nice interface to editing the data as well as searching the data.
  3. This edited data should be in some format that merges well (for dealing with branching), and preferably is standardized so you can use common tools on the data – such as XSLT if storing data as xml. XML isn’t commonly very mergable so not sure the solution there other than perhaps a custom merge utility perhaps?
  4. The “data solution” / project file should store your preferences about how you want the final data format to be: xml, json, binary, other? Checkboxes for compression and encryption, etc. Switching the data format should take moments.
  5. There should be a cooking process that can be run from the data editor or via command line which transforms the edited data into whatever format the destination data should be in. AKA turn the human friendly XML into machine friendly binary files which you load in with a single read and then do pointer fixup on.
  6. This pipeline should generate the code that loads and interacts with the data as described in the data schema. For instance you say “load my data” and it does all the decompression, decryption, parsing, etc giving you back a root data structure which contains compile time defined strongly typed structures. This is important because when you change the format of the data that the game uses, no game code actually has to know or care. Whatever it takes to load your data happens when you call the function.

Bonus Points

Here are some bonus point features that would be great to have:

  1. Handle live editing of data. When the game and editor is both open, and data is edited, have it change the data on the game side in real time, and perhaps allow a callback to be intercepted in case the game needs to clear out any cached values or anything. This helps iteration time by letting people make data changes without having to relaunch the game. Also needs to be able to connect to a game over tcp/ip and handle endian correction as needed as well as 32 vs 64 bit processes using the same data.
  2. Handle the usual problems associated with DLC and versioning in an intelligent way. Many data systems that support DLC / Patching / Schema Updates post ship have strange rules about what data you can and can’t change. Often times if you get it wrong, you make a bug that isn’t always obvious. If support for this was built in, and people didnt have to concern themselves with it, it’d be great.
  3. On some development environments, data must be both forwards and backwards compatible. Handling that under the covers in an intelligent way would be awesome.
  4. The editor should be extensible with custom types and plugins for visualizations of data, as well as interactive editing of data. This same code path could be used to integrate parts of the game engine with the editor for instance (slippery slope to making the editor slow, however).
  5. Being able to craft custom curves, and being able to query them simply and efficiently from the game side at runtime would be awesome.
  6. Support “cook time computations”. The data the user works with isn’t always set up the way that would be best for the machine. It’d be great to be able to do custom calculations and computations at runtime. Also great for building acceleration data structures.
  7. You should be able to run queries against the data or custom scripts. To answer questions like “Is anyone using this feature?” and “I need to export data from our game into a format that this other program can read”
  8. Being able to export data as c++ literal data structures, for people who want to embed (at least some of) their data in the exe to reduce complexity, loading times, etc.

It should also be as fast and lightweight as possible. It should allow games to specify memory and file i/o overrides.

Localized text is also a “solved problem” that needs an available solution. It could perhaps be rolled into this, or maybe it would make most sense for it to be separate.

As another example of how having something like this would be useful, on multiple occasions at previous studios, people have suggested we change the format of the data that the game uses at runtime. For instance, from json to a binary format. In each case this has come up so far, people have said it would take too long and it got backlogged (ie killed). With data pipeline middleware that works as i describe it, you would click a few checkboxes, recook your data and test it to have your runtime results. That’s as hard as it SHOULD be, but in practice it’s much harder because everyone rolls their own and the cobbler never has time to fix his shoes (;

Anyone out there want to make this happen? (:



Using Wang Tiles to Simulate Turing Machines

Wang tiles were invented by Hao Wang in 1961 for mathematical reasons, but they find great use in games for making tile based art which gives results that don’t look tiled – both with 2d tiled textures, as well as 3d tiled models.

Apparently Wang tiles are also able to execute Turing machines, and so are thus Turing complete – meaning they can execute any program.

That is a pretty amazing and perplexing statement, so this post explores that a bit.

Wang Tiles Overview

Wang tiles are rectangular tiles where each edge will only fit with other specific edges, but that for any specific edge, there is more than one possible tile that can fit with that edge. By fit with that edge, I mean they are seamless when put together, without any visual artifacts to hint at there actually being a seam between the tiles.

This is useful for graphics because this lets you have seamless tiled graphics, but the specific configuration of how the tiles are placed can be completely randomized, so long as their edges are all compatible. The result is tiled graphics that doesn’t look at all tiled, due to visible patterns being much less noticeable than with traditional tiled graphics.

For graphical examples, a little more info and some links to some shadertoys check this out: Wang Tiling

Here is an example I made. The graphics are programmer art but hopefully you get the idea. This was made with 16 tiles, where there were two different edge types per edge.

Turing Machine Overview

Turing machines were invented in 1936 by Alan Turing as a generic computing machine that was proven to be able to execute any algorithm.

The turing machine is made up of a few main components: the memory tape, the read/write head, and the state machine.

The memory tape is infinitely long, so has infinite storage, and is initialized to all zeroes to start out.

The read/write head starts at a position on the tape, and can read or write values, and also move left or right on the tape.

The state machine controls the read/write head.

The state machine knows what state it is in and has rules about what to do in each state when it reads a value from the tape.

For instance, in state A, if a 0 is read from the tape, the rule may be to write a 1 to the current position on the tape, move the read/write head to the right, and go to state B. State B may have completely different logic, and could either transition back to state A, state in state B, or move to another state entirely.

Using simple state transition logic like that, any computer algorithm can be performed.

In a Turing machine there can also be a “Halt State” which means the program is finished executing and the answer it was trying to calculate has been calculated.

Looking at some programs, you can easily see that they will halt eventually, or that they will be an infinite loop and never halt. Some programs in-between are complex and can’t very easily be determined if they will ever halt or not. Turing proved that there is no general solution to whether a Turing machine (aka any computer program) will halt or not, and this is called the Halting Problem. In general, the only way to know if a program will halt or not is to wait and see. So, effectively the answers to whether a program in general will halt or not are “yes” and “not yet” – although for many specific programs you can in fact see that they will halt eventually if you were to run them.

Wang Tile Computation

It turns out that Wang tiles can simulate a Turing machine, and so are “Turing complete” meaning that they too can perform any computer algorithm.

To make this happen, we’ll make a column of tiles that represent the state of the Turing machine at a specific point in time, starting with time 0 at the left most column. We’ll place tiles in the column to the right making sure all edge rules are respected, and then do the column to the right of that one etc until the program halts (or forever if it doesn’t halt). If we set up our set of tiles correctly, the act of satisfying the edge rules as we place our tiles is enough to execute the Turing machine.

Let’s walk through a simple example where we have the following state machine logic rules:

  1. When in state A, if a 0 is read, we will write a 1, move the read/write head down and move to state B.
  2. When in state A, if a 1 is read, we will halt (enter the halt state).
  3. When in state B, if a 0 is read, we will write a 1, move the read/write head up and move to state A.
  4. When in state B, if a 1 is read, we will halt (enter the halt state).

Tape Memory Storage

The first thing we need is persistent storage of memory for the tape. For this, we’ll need the following two tiles:

To see this working, we can set up a section of tape with some values (make a column of wang tiles), and we can see that the only valid wang tiles to place next to the starting column are tiles which propogate the 0 and the 1 values forward in time without modifying them.

In the diagram below, we initialize the tape to 0101 in the left most column (time 0). By only placing down tiles with compatible edges you can see that our memory values persist forever. Our memory storage is implemented, huzzah!

We’ll start our example with all memory initialized to 0, but the above shows that we can have persistent memory.

Read/Write Head State Machine

The read/write head of the Turing machine is represented as part of the edge information. In this way, besides an edge storing the 0 or 1, if that is where the read/write head is, it also stores the state of the state machine.

Our example uses two states (besides the halt state): A and B. If a 1 is read in while being in either state A or B, the program halts.

To handle that, we need the tiles below:

Now that we have the rules for entering the halt state done (rule #2 and rule #4), we have to figure out how to implement the rules that control switching from one state to another (rule #1 and rule #3).

Moving the Read/Write Head

Rule #1 says that if we are in state A and read a 0, we should write a 1, move the read/write head down and move to state B.

We’ll need this tile to cause reading a 0 in state A to write a 1 as output, and to tell the tile below to move to state B.

The tile below that one could either be a 0 or a 1, and without knowing which, we want it to keep it’s value but accept the read/write head and be in state B. To do that we need two tiles, one for if there is a 0 on the tape at that position, and another for if there is a 1 on the tape.

Rule #3 says that if we are in state B and read a 0, we should write a 1, move the read/write head up and move to state A.

To do that, we’ll need a similar construction as for rule #1 but we are moving up instead of down. These 3 tiles will give us what we need:

Starting Column Tiles

We are going to treat the boundaries of our simulation area as if they have edges of “x”.

This means that to make our starting column (the Turing machine at time 0), we are going to need 2 special tiles. One tile will be for storing a 0 on the tape, which is what the tape is initialized to, and the other tile will be for storing the position of the read/write head in state A, which is our starting state.

Here are those two tiles:

Final Tileset

Here’s the full set of 12 tiles that we are going to use:

Full Simulation

Here is the initial setup at time 0 for our Turing machine. Note that this is one possible starting state, but this is the starting state we are choosing. We are not leaving it up to chance where the read/write head starts, or if it is even present at all. If we only followed edge rules we may get 4 read/write heads or 0, or anything in between.

From here, to build the second column, we start from the top and work towards the bottom, choosing the tile that fits the constraints of the edge it touches. In this first step, the head reads a 0, writes a 1, moves down, and moves to state B.

Heres is the second step, where the read reads a 0, writes a 1, moves up, and moves to state A.

Here is the final step, where the head reads a 1 and enters the halt state, signifying that the program has terminated.

The program halted, and gave an output value of “0110” or 6. This output isn’t really meaningful but other programs can give output that is meaningful. For instance you could have a Turing machine add two numbers, and the output would be the sum of those two numbers.

An Important Detail

There is an important detail that the above doesn’t address, and that many explanations of Wang tile Turing machines don’t seem to talk about.

When placing the second tile for time 2, the only constraint from the edges is that the tile must have an x on top and a 1 on the left. This actually makes it ambiguous which tile should be chosen between the two tiles below.

How do we choose the right one then?

The answer is that you make a guess and just choose one. If the wrong one was chosen in this case, when we moved to the next tile, we’d be looking for a tile which had an x on top and a B0 on the left. There is no such tile so you’d be unable to place a tile. When this happened, you’d take a step back to the last tile, and try one of the other possibilities.

So, unfortunately there is some literal trial and error involved when simulating Turing machines with Wang tiles, but it is fairly manageable at least. It definitely makes it a bit more complex to calculate in a pixel shader if you were so inclined (or other massively parallel processing units), but it shouldn’t be that much more costly.

Closing & Links

Some of the links below talk about Wang tiles and Turing machines, but don’t seem to strictly be Turing machines anymore. For instance, you might notice that some examples allow data to travel “back in time” where after the program halts, the answer is in the tape at time 0 of the Turing machine, even though that data wasn’t actually there at time 0. This shows that Wang tiles can do computation in their own right, beyond simulating Turing machines, but I’m not really sure what that technique itself would be called.

Also if you are wondering if this is useful to do computation with Wang tiles, I’m not really sure of any practical usage cases myself. However, apparently scientists have found that DNA can act much like Wang tiles act, where they will fit together only if edges are compatible. Because of this, there is ongoing research into DNA based computation that is based on the work of Wang tile computation. pretty interesting stuff!

Here is a shadertoy implementation of wang tiles computing prime numbers in a webgl pixel shader:
Shadertoy: WangTiles : PrimeGenerator

Here are some great videos on Turing machines and the halting problem:
Turing Machines Explained – Computerphile
Turing & The Halting Problem – Computerphile

Here are some other links:
Computing With Tiles
Wikipedia: Wang Tile
Wang Tiles and Turing Machines
Wang Tiles – 1

Here are some academic papers:
Computing With Tiles
Computability of Tilings



A Sixth Way To Calculate Sine Without Trig

I have another item to add to the pile of ways to calculate sine without trig. Here are the previous ways before we start:

Four Ways to Calculate Sine Without Trig

A Fifth Way to Calculate Sine Without Trig

This method is called Bhaskara I’s sine approximation formula and it’s just a numerical way of approximating sine.

The below is some glsl code from @paniq that has been adapted to take 0 to 1 as input, which corresponds to 0 to 2*pi radians, or 0 to 360 degrees, and returns the normalized vector of that angle. The x component of the vector is the cosine of the angle and the y component of the vector is the sine of the angle. Useful for packing 2d normals into a color channel (;

// x is 0..1 corresponding to 0..360 degrees
vec2 CosSin(float x) {
    vec2 si = fract(vec2(0.5,1.0) - x*2.0)*2.0 - 1.0;
   	vec2 so = sign(0.5-fract(vec2(0.25,0.5) - x));    
    return (20.0 / (si*si + 4.0) - 4.0) * so;

Here’s a shadertoy to compare/contrast this technique versus reality (or, reality as per the video card). Spoiler alert – the shadertoy is ridiculous, they are basically the same.
Shadertoy: Bhaskara Cos Sin Approximation

More from paniq:

i also wrote an approximate atan to go with it Shadertoy: Pseudo-Polar Mapping see the ALTMETHOD branch. also changed the sin/cos computation to ensure the sin/cos vector is perfectly normalized.



Matrix Form of Bezier Curves

Bezier curves are most often talked about either in terms of the De Casteljau algorithm, or in terms of a mathematical function (Bernstein Polynomials).

Every now and then though, you see people talking about Bezier curves being calculated via matrices. If you ever wondered what that was all about, this post should hopefully explain and demystify that a bit.

If you don’t know how to come up with the equation of a Bezier curve for any number of control points, you should give this a read first:
Easy Binomial Expansion & Bezier Curve Formulas

And if you are curious about the De Casteljau algorithm, you can learn about that here:
The De Casteljau Algorithm for Evaluating Bezier Curves

Ok, all read up on that stuff? Let’s get talking about Bezier curves in matrix form! There are shadertoy links at the end with working wegl glsl demos that include source code.

Making the Matrix Form of Bezier Curves

Coming up with the matrix for a Bezier curve is surprisingly easy. Keep in mind the matrix we are making is for glsl which is a column major matrix order, so you might have to adjust things if you are using a row major matrix order setup (mostly, just transpose the matrix).

The first step is to get the formula for a Bezier curve. We’ll work through the example using a quadratic Bezier curve with 3 control points A,B,C, so we start with the formula below:

f(t) = A*(1-t)^2 + B*2t(1-t) + C*t^2

The next step is to break the equation into one equation per term. Each term has a control point, so we are basically splitting the formula up so that we have one formula per control point.

A*(1-t)^2 \\  B*2t(1-t) \\  C*t^2

Next, we remove the control points and expand each term to get:

1-2t+t^2 \\  2t-2t^2 \\  t^2

Now, explicitly values of all powers of t that are present:
1*t^0-2*t^1+1*t^2 \\  0*t^0+2*t^1-2*t^2 \\  0*t^0+0*t^1+1*t^2

Now the final step. Take the constants that multiply your powers of t and make a matrix out of them. You are done!

\begin{bmatrix}  1 & -2 & 1 \\  0 & 2 & -2 \\  0 & 0 & 1 \\  \end{bmatrix}

Using the Matrix Form

Using the matrix form of Bezier curves is also pretty simple.

First, we need to make a vector of the power series of our t value:

powerSeries = \begin{bmatrix}  t^0 & t^1 & t^2 \\  \end{bmatrix}

Which can also be written as:

powerSeries = \begin{bmatrix}  1 & t & t^2 \\  \end{bmatrix}

You also need a vector of your control points:

controlPoints = \begin{bmatrix}  A & B & C \\  \end{bmatrix}

You next perform this operation to get a result vector:

result = powerSeries * curveMatrix * controlPoints

Then, you add up all components of result to get the value of the curve at time t.

value = result[0] + result[1] + result[2]

All done!

Note that this is a one dimensional Bezier curve. You need to do this operation once per axis to get your final multi dimensional Bezier curve point.

If you are confused by that last line, check out this post: One Dimensional Bezier Curves

Multiplying the Control Points In

You might notice that if you are evaluating several points on the same curve that you are going to be multiplying the curveMatrix matrix by the controlPoints vector over and over. You can multiply the control points into the Bezier curve matrix to make the specific matrix for those control points if you want to. You multiply the columns of the matrix by the control points, and adjust the result calculation like the below.

// Multiply the control points into the curve matrix
curveMatrix[0] *= A;
curveMatrix[1] *= B;
curveMatrix[2] *= C;

// Use the curve matrix that has the control points baked in, to do less math to get the result vector.
// You would calculate the curve matrix once and re-use it multiple times of course!
vec3 result = powerSeries * curveMatrix;
float value = result.x + result.y + result.z;


You might wonder when you’d use the matrix form. One time to use the matrix form would be when you had fast matrix math support (like on the GPU). Another time to use the matrix form though is if you ever want to cut up a Bezier curve into multiple smaller sub curves. The matrix form can help make that easier, and you can read more about that here if you want: A Matrix Formulation of the Cubic Bezier Curve

Here are some shadertoys that show this all working in webgl/glsl pixel shaders, along with source code:

Shadertoy: 1D Linear Bezier Matrix Form
Shadertoy: 1D Quadratic Bezier Matrix Form
Shadertoy: 1D Cubic Bezier Matrix Form



Actually Making Signed Distance Field Textures With JFA

This post is an addendum to the last post where I say that you can make distance field textures with JFA but don’t fully explain how to make SIGNED distance field textures, which is what you really want.

If you want to go straight to a working demo with webgl pixel shader source code, here is the shadertoy: Shadertoy: JFA SDF Texture

If you naively use a distance transform to make a distance field texture, you’ll get an UNSIGNED distance field texture, where you only have the distance to the surface of the object from the outside, but won’t have the distance to the surface of the object from the inside.

This is important because signed distance field textures have both, and use bilinear interpolation of distance on each side of the shape surface to make a nice smooth line. Below is what happens when you try to use an unsigned distance field texture (aka the distance transform of the image, using JFA / Voronoi information), using the zero distance line as the surface of the object:

It looks ok (if not fairly pixelated), but you can really see it break down when you zoom in:

So you might say to yourself, maybe i need to keep the surface line at distance 0.5 instead of 0.0 so that there is distance information to interpolate? If you do that, the first thing you might notice is that the objects get fatter:

But it does look better when you zoom in, which is a plus:

The real issue is that you really just need the distance from each pixel to the surface of the object from both the inside and the outside. In our case, our Voronoi diagram we make with JFA only gives the distance from the outside. So what is the solution? At first I was thinking maybe you can get the gradient of this data at the point of each pixel and “push the zero line in” a little bit to give at least one pixel layer worth of inside data. However, a brilliant friend of mine came up with the actual solution: You invert your source data so empty space becomes seed, and seed becomes empty space, and you run JFA again to get the distance from the inside!

That actually works very well. It’s also very easy to combine them. You make a pixel shader that reads the data from the outside Voronoi diagram and the inside Voronoi diagram, calculate the output distance (0.5 + outsideDistance * 0.5 – insideDistance * 0.5), and output that 0 to 1 distance value in one or more of the color channels.

Here’s a glsl excerpt below, note that we divide the distance by 8 and clamp between 0 and 1 so that the data is suitable for a normalized color image (normalized as in the color channels can store values between 0 and 1):

// calculate distances from seed coordinates
float outsideDist = clamp(length(outsideSeedCoord-fragCoord) / 8.0, 0.0, 1.0);
float insideDist  = clamp(length(insideSeedCoord-fFragCoord)  / 8.0, 0.0, 1.0);
// calculate output distance
float signedDistance = 0.5 + outsideDist * 0.5 - insideDist * 0.5;
// set the color based on that distance
fragColor = vec4(signedDistance);

It actually looks a lot like the first image where we use the zero distance line of the unsigned distance field texture, so we still aren’t quite there:

When you zoom in, it looks a little better, but something still seems a bit off:

The final step to making this look good is to realize that the power of signed distance textures is in their ability to interpolate distance information well. When we have a full resolution texture, there is no interpolation going on. We actually need to decrease the size of our distance field texture to make it look better. If only all problems were solved by making textures smaller!

Here is the resulting image when making the distance field texture 1/4 as large on each axis (1/16th as big total):

And zooming in you can see that it scales very well. The zoom is a 20x magnification, on top of the magnification we already get from it being a larger texture:

And just to show the intermediary textures, here is the outside distance Voronoi diagram:

And the inside distance Voronoi diagram (The seed is in bright green, the dim green is the empty space that has distance information):

And here is the final distance field texture used to render the final result I showed above.

Zoomed in to show just how low resolution it is! This is the thing that looks like a + or a sword just left of middle.

Again, here is the shadertoy that does this technique, generating a signed distance field texture on the fly for randomly placed objects, and then using that signed distance field to render a larger image that you can further zoom in to:
Shadertoy: JFA SDF Texture



Fast Voronoi Diagrams and Distance Field Textures on the GPU With the Jump Flooding Algorithm

The image below is called a Voronoi diagram. The circles show you the location of the seeds and the color of each pixel represents the closest seed to that pixel. The value of this diagram is that at any point on the image, you can know which seed (point) is the closest to that pixel. It’s basically a pre-computed “closest object” map, which you can imagine some uses for I bet.

Voronoi diagrams aren’t limited to just points as seeds though, you can use any shape you want.

One way Voronoi diagrams can be useful is for path finding because they are dual to (the equivelant of) Delauny triangulation (More info at How to Use Voronoi Diagrams to Control AI).

Another way they can be useful is in generating procedural content, like in these shadertoys:
Shadertoy: Voronoi – rocks
Shadertoy: Voronoi noises

Another really cool usage case of Voronoi diagrams is in creating what is called the “Distance Transform”. The distance transform calculates and stores the distance from each pixel to the closest seed, whichever one that may be. Doing that, you may get an image like this (note that the distance is clamped at a maximum and mapped to the 0 to 1 range to make this image).

That is what is called a distance texture and can be used for a very cool technique where you have texture based images that can be zoomed into / enlarged quite a ways before breaking down and looking bad. The mustache in the image below was made with a 32×16 single channel 8bpp image for instance! (ignore the white fog, this was a screenshot from inside a project I was working on)

Distance field textures are the next best thing to vector graphics (great for scalable fonts in games, as well as decals!) and you can read about it here: Distance Field Textures

Now that you know what Voronoi diagrams and distance transforms are used for, let’s talk about one way to generate them.

Jump Flooding Algorithm (JFA)

When you want to generate either a Voronoi diagram or a distance transform, there are algorithms which can get you the exact answer, and then there are algorithms which can get you an approximate answer and generally run a lot faster than the exact version.

The jump flooding algorithm is an algorithm to get you an approximate answer, but seems to have very little error in practice. It also runs very quickly on the GPU. It runs in constant time based on the number of seeds, because it’s execution time is based on the size of the output texture – and is based on that logarithmically.

Just a real quick note before going into JFA. If all else fails, you could use brute force to calculate both Voronoi diagrams as well as distance transforms. You would just loop through each pixel, and then for each pixel, you would loop through each seed, calculate the distance from the pixel to the seed, and just store the information for whatever seed was closest. Yes, you even do this for seed pixels. The result will be a distance of 0 to the seed 😛

Back to JFA, JFA is pretty simple to program, but understanding why it works may take a little bit of thinking.

Firstly you need to prepare the N x N texture that you want to run JFA on. It doesn’t need to be a square image but let’s make it square for the explanation. Initialize the texture with some sentinel value that means “No Data” such as (0.0, 0.0, 0.0, 0.0). You then put your seed data in. Each pixel that is a seed pixel needs to encode it’s coordinate inside of the pixel. For instance you may store the fragment coordinate bitpacked into the color channels if you have 32 bit pixels (x coordinate in r,g and y coordinate in b,a). Your texture is now initialized and ready for JFA!

JFA consists of taking log2(N) steps where each step is a full screen pixel shader pass.

In the pixel shader, you read samples from the texture in a 3×3 pattern where the middle pixel is the current fragment. The offset between each sample on each axis is 2^(log2(N) – passIndex – 1), where passIndex starts at zero.

That might be a bit hard to read so let’s work through an example.

Let’s say that you have an 8×8 texture (again, it doesn’t need to be square, or even power of 2 dimensions, but it makes it easier to explain), that has a 16 bit float per RGBA color channel. You fill the texture with (0.0, 0.0, 0.0, 0.0) meaning there is no data. Let’s say that you then fill in a few seed pixels, where the R,G channels contain the fragment coordinate of that pixel. Now it’s time to do JFA steps.

The first JFA step will be that for each pixel you read that pixel, as well as the rest of a 3×3 grid with offset 4. In total you’d read the offsets:
(-4.0, -4.0), (0.0, -4.0), (4.0, -4.0),
(-4.0, 0.0), (0.0, 0.0), (4.0, 0.0),
(-4.0, 4.0), (0.0, 4.0), (4.0, 4.0)

For each texture read you did, you calculate the distance from the seed it lists inside it (if the seed exists aka, the coordinate is not 0,0), and store the location of the closest seed int the output pixel (like, store the x,y of the seed in the r,g components of the pixel).

You then do another JFA step with offset 2, and then another JFA step with offset 1.

JFA is now done and your image will store the Voronoi diagram, that’s all! If you look at the raw texture it won’t look like anything though, so to view the Voronoi diagram, you need to make a pixel shader where it reads in the pixel value, deocdes it to get the x,y of the closest seed, and then uses that x,y somehow to generate a color (use it as a seed for a prng for instance). That color is what you would output in the pixel shader, to view the colorful Voronoi diagram.

To convert the Voronoi diagram to a distance transform, you’d do another full screen shader pass where for each pixel you’d calculate the distance from that pixel to the seed that it stores (the closest seed location) and write the distance as output. If you have a normalized texture format, you’re going to have to divide it by some constant and clamp it between 0 and 1, instead of storing the raw distance value. You now have a distance texture!

More Resources

I originally came across this algorithm on shadertoy: Shadertoy: Jump Flooding. That shadertoy was made by @paniq who is working on some pretty interesting stuff, that you can check out both on shadertoy and twitter.

The technique itself comes from this paper, which is a good read: Jump Flooding in GPU with Applications to Voronoi Diagram and Distance Transform


While JFA as explained is a 2D algorithm, it could be used on volume textures, or higher dimensions as well. Higher dimensions mean more texture reads, but it will still work. You could render volume textures with raymarching, using the distance information as a hint for how far you could march the ray each step.

Also, I’ve played around with doing 5 reads instead of 9, doing a plus sign read instead of a 3×3 grid. In my tests it worked just as well as regular JFA, but I’m sure in more complex situations there are probably at least minor differences. Check the links section for a shadertoy implementation of this. I also tried doing a complete x axis JFA followed by a y axis JFA. That turned out to have a LOT of errors.

You can also weight the seeds of the Voronoi diagram. When you are calculating the distance from a pixel to a seed point, you use the seed weight to multiply and/or add to the distance calculated. You can imagine that perhaps not all seeds are created equal (maybe some AI should avoid something more than something else), so weighting can be used to achieve this.


Here are some shadertoys I made to experiment with different aspects of this stuff. You can go check them out to see working examples of JFA in action with accompanying glsl source code.

Jump Flood Algorithm: Points – Point Seeds
Jump Flood Algorithm: Shapes – Shape Seeds
Jump Flood Algorithm: Weight Pts – Multiplicative Weighting
Jump Flood Algorithm: Add Wt Pts – Additive Weighting
Separable Axis JFA Testing – Does 5 reads instead of 9, also shows how separating axis completely fails.

Here is a really interesting shadertoy that shows a way of modifying a Vornoi diagram on the fly: Shadertoy: zoomable, stored voronoi cells



GPU Texture Sampler Bezier Curve Evaluation

Below is a paper I submitted to that unfortunately did not get accepted. Maybe next time!

The main idea of this paper is that bilinear interpolation can be equivalent to the De Casteljau algorithm, which means that if you set up a texture in a specific way, and sample from it at specific texture coordinates, that it will in fact give you Bezier curve points as output! It scales up for higher dimensional textures, as well as higher order curves.

The image below shows this in action for a cubic Bezier curve (3 control points) being stored and recalled from a 2×2 texture (there is actually a curve stored in each color channel).

This image is from an extension linked to lower down which applies the technique to surfaces and volumes:

The primary feedback from the reviewers and editor was that:

  • It was an interesting technique and they thought it was a paper worth reading.
  • The usage case was fairly limited though – basically only when your are compute bound in your shader program, and have some curve calculations to offload to the texture sampler. Or if you are already using a lookup texture and would benefit from fewer instructions and smaller lookup textures.
  • It could have been shorter due to the writing being shorter, but also it could have been less thorough. For instance, it didn’t need to show equivalence to both the De Casteljau’s algorithm as well as Bernstein polynomials, since it’s already known that those are equivalent.
  • They wanted some more performance details

I agree with the feedback, and don’t feel like taking the time to change and resubmit or submit else where, so I’m sharing it here on my blog. I hope you enjoy it and find it interesting (:

Here is the paper:

Here is the supplemental materials (opengl and webgl source code):

Here is the webgl demo from the supplemental materials, hosted on my site:
GPU Efficient Texture Based Bezier Curve Evaluation

Here are some working shadertoy demos of the technique:
Shadertoy: Mystery Curves – Quadratic
Shadertoy: Mystery Curves – Cubic
Shadertoy: Mystery Curves – Quartic
Shadertoy: Mystery Curves – Quintic


Continuations of this work:

Failed Experiments

Continuations that didn’t work out:

What are your thoughts? Is this cool? Is it lame? Got some ideas to improve it? Leave a comment! (:



G-Buffer Upsizing

The other day I had a thought:

Rendering smaller than full screen images is super helpful for performance, but upsizing an image results in pretty bad quality vs a full resolution render.

What if instead of upsizing the final rendered image, we upsized the values that were used to shade each pixel?

In other words, what if we rendered a scene from a less than full resolution g-buffer?

I was thinking that could be useful in doing ray based graphics, not having to trace or march quite so many rays, but also perhaps it could be useful for things like reflections where a user isn’t likely to notice a drop in resolution.

I’m sure I’m not the first to think of this, but I figured I’d explore it anyways and see what I could see.

I made an interactive shadertoy demo to explore this if you want to see it first hand:
Shadertoy: G-Buffer Upsizing


In short, it does look better in a lot of ways because the normals, uv coordinates and similar parameters interpolate really well, but the edges of shapes are aliased really bad (jaggies).

Check out the images below to see what i mean. The first image is a full sized render. The second image is a 1/4 sized render (half x and half y resolution). The third image is a 1/16th sized render (quarter x and quarter y resolution)

For comparison, here’s a 1/4 sized and 1/16 sized render upsized using bicubic IMAGE interpolation instead of g-buffer data interpolation:

Details & More Information

Despite the aliased results at 1/16th render size, this seems like it may be a reasonable technique at larger render sizes, depending on the level of quality you need. Doing half vertical or half horizontal resolution looks very close to the full sized image for instance. The edges are a tiny bit more aliased along one direction, but otherwise things seem decent:

Since the g-buffer has only limited space, you will probably want to bit pack multiple fields together in the same color channels. When you do that, you throw out the possibility of doing hardware interpolation unfortunately, because it interpolates the resulting bit packed value, not the individual fields that you packed in.

Even when doing the interpolation yourself in the pixel shader, for the most part you can really only store information that interpolates well. For instance, you could store a diffuse R,G,B color, but you wouldn’t want to store and then interpolate a material index. This is because you might have material index 10 (say it’s blue) next to material index 0 (say it’s green), and then when you interpolate you could end up with material index 5 which may be red. You’d get red between your blue and green which is very obviously wrong.

In my demo I did have a material index per pixel, but i just used nearest neighbor for that particular value always. To help the appearance of aliasing, I also stored an RGB diffuse color that i interpolated.

I stored the uvs in the g-buffer and sampled the textures themselves in the final shader though, to make sure and get the best texture information I could. This makes textures look great at virtually any resolution and is a lot of the reason why the result looks as good as it does IMO.

The fact that normals interpolate is a good thing, except when it comes to hard edges like the edge of the cube, or at the edge of any object really. In the case of the cube edge, it smooths the edge a little bit, making a surface that catches specular lighting and so highlights itself as being incorrect (!!). In the case of the edge of regular objects, a similar thing happens because it will interpolate between the normal at the edge of the object and the background, making a halo around the object which again catches specular lighting and highlights itself as being incorrect.

I think it could be interesting or fruitful to explore using edge detection to decide when to blend or not, to help the problem with normals, or maybe even just some edge detection based anti aliasing could be nice to make the resulting images better. The depth (z buffer value) could also maybe be used to help decide when to interpolate or not, to help the problem of halos around every object.

Interestingly, bicubic interpolation actually seems to enhance the problem areas compared to bilinear. It actually seems to highlight areas of change, where you would actually want it to sort of not point out the problems hehe. I think this is due to Runge’s phenomenon. Check out the depth information below to see what i mean. The first is bilinear, the second is bicubic:

One final side benefit of this I wanted to mention, is that if you are doing ray based rendering, where finding the geometry information per pixel can be time consuming, you could actually create your g-buffer once and re-shade it with different animated texture or lighting parameters, to give you a constant time (and very quick) render of any scene of any complexity, so long as the camera wasn’t moving, and there were no geometry changes happening. This is kind of along the same lines as the very first post I made to this blog about 4 years ago, which caches geometry in screen space tiles, allowing dirty rectangles to be used (MoriRT: Pixel and Geometry Caching to Aid Real Time Raytracing).

Anyone else go down this path and have some advice, or have any ideas on other things not mentioned? (:

Next up I think I want to look at temporal interpolation of g-buffers, to see what sort of characteristics that might have. (Quick update, the naive implementation of that is basically useless as far as i can tell: G-Buffer Temporal Interpolation).

Related Stuff

On shadertoy, casty mentioned that if you have some full res information, and some less than full res information, you can actually do something called “Joint Bilateral Upsampling” to get a better result.

Give this paper a read to learn more!
Joint Bilateral Upsampling

It turns out someone has already solved this challenge with great success. They use “the MSAA trick” to get more samples at the edges. Check out ~page 38:
GPU-Driven Rendering Pipelines