Failed Shadertoy “Dust” Game – Browsers Need To Compile Shaders Off the Main Thread!

I was working on a Dust Game on shadertoy for a few weeks and got a decent amount of the way through it. Unfortunately I hit a snag.

There’s a sad fact that browsers compile WebGL shaders on the main thread. If ever WebGL blocks for too long, browsers consider WebGL hung and reports to the user that WebGL crashed.

The shaders for my game got complex enough that they were hitting the timeout while compiling, even though I wasn’t even close to being finished. So, I simplified a bit and published it unfinished, thinking that perhaps when WebGL 2.0 comes out this summer, I can revisit it.

Unfortunately, but totally understandably, different people have different machines in the wild, and some people have computers that take longer to compile shaders than my computer does, which means that plenty of people were still “crashing”. Because of this, the shadertoy people moved my shader back to draft/private until I simplified it more since crashes are no good, even if they are not realy real crashes in the normal sense. You can still see it by direct link here if you are interested: Shadertoy: Dust Sim Sandbox (Incomplete)

I’ve spent enough time on this thing so I’m going to leave it private for now, but I really think this shows a fundamental problem with WebGL.

Since the shaders compile on the main thread (apparently in many browsers if not all!), and detect a crash/hang by timing out, it means that when you write WebGL code, you really have to either keep things super simple in your shaders, or you have to be ok with having some amount of users experiencing “crashes” when the shader compiles time out on their machine.

In other words, you have to choose between innovation and stability.

That doesn’t really bode well for a platform like WebGL that basically is all about innovation! I’m sure this same problem plagues WebVR, which means that if you want to make a WebVR experience that works well on all machines, you better not do anything very complex 😛

Despite this failure, I did succeed in making a minesweeper game though, check it out here: Shadertoy: Minesweeper

Anyways, onto the next next shadertoy! I think I’m going to try to make one of those side view physics based car/motorcycle/bicycle hill climbing games.

Update!

It turns out my understanding of the shader compiling process was a bit simplified. Read the link below for the full scoop, but in the end, the problem remains, because there doesn’t seem to be any way to query the status of the async work, so you seem to have to guess and hope you have enough work to do. In the case of shadertoy, there is no geometry to load, or things of that nature, so that makes it especially difficult to deal with.

Asynchronous Shader Compilation

Thanks to @tuan_kuranes for the better information (:

Making a Ray Traced Snake Game in Shadertoy

Shadertoy.com recently added multipass rendering, and one of the founders – iq – introduced the feature to us by (ab)using an off screen texture to store state from previous frames. This was interesting because up until that point, you had to render each frame using only the current time, and current input from the user – you had no knowledge whatsoever of the past. He ended up making a brick breaking game, which is pretty awesome. You can see it here: Shadertoy: Bricks Game.

I decided to follow that path too and make a snake game in shadertoy. This is a high level overview of how the game works and various considerations that came up while creating the game.

Play Shadertoy: Snake Game

Simulation vs. Presentation

When you start making a shadertoy shader, you have a single pixel shader program to work in. In the past this is where you did all your work to display the image to the user.

Now with multipass rendering you have the option of creating an off screen buffer (up to four of them actually, but we only need one for this discussion), which you write a pixel shader program for as well. This program is able to write to the off screen buffer by emitting pixels (like usual), but interestingly, it can also read from the same off screen buffer. Since it can both read and write the image, this allows it to preserve state across frames.

Since this pixel shader has state (unlike the main on screen pixel shader), this is where all your “simulation code” has to go. This is where you do all your game logic. Strangely, even though this buffer is a full screen buffer, you only need to process as many pixels as you need for storage of your game state. You can use the “discard” keyword for all other pixels, to get a quick “early out” in the pixel processing.

The on screen pixel shader is able to read this off screen buffer which stores game state, but it isn’t able to write to it. Because of this, the on screen pixel shader is essentially the visualization of the simulation.

Simulation Considerations

The simulation had quite a few considerations. Some of these were a little mind bending at first, but made sense after thinking about it for a while.

Perhaps the single strangest thing about the simulation pixel shader program is this: The pixel shader program can only write the value of the pixel that it represents. Let’s talk some details about what that means.

Reading and Writing Variables

In the snake game, there is a grid of 16×16 pixels, where each pixel stores the state of a game cell. Besides this, there are four other pixels that store various pieces of game state – like what direction the snake is moving in, whether the snake is dead or not, and the location of the apple.

One complication of this is that to spawn an apple, we have to set the apple’s location variable, but we also have to mark that an apple is in the specific grid cell where it should be. What we essentially have to do is pick a random location for the apple to go to, and then we essentially have to say “if the pixel that this shader program is running for, is the specific grid cell where the apple should spawn at, then emit a pixel color that represents the necessary data to represent an apple”.

That is kind of weird. Even weirder perhaps, you have to remember that this pixel shader program is being run for several pixels per frame (once per variable you have in your state!), and they all have to be in agreement about the random position of the apple so that only one apple gets written to the board, so your random numbers have to be deterministic, yet still seemingly random to the player.

It’s odd to think that the entire game simulation is running per variable in your simulation, but that is what is happening.

There’s a trap that’s easy to fall in here too, where you might be tempted to write code like this: “If the pixel that we are running the shader program for is the new location that the snake wants to move to, and that grid cell contains a snake body part, then set the game over game state variable flag to true”

The problem there is that our if statement guarantees that we are running the pixel shader program for a grid cell, not for the game over game state variable, so when we change the game over state variable, it doesn’t get written out, since this pixel shader program instance isn’t running for that variable.

I actually wrote that exact code and it took me a little while to figure out why the game wouldn’t end when the snake crashed into itself, but it would end if it went out of bounds. The code looked very similar, and was very simple, and one worked, but the other didn’t.

The answer here is that instead of only running the logic for the grid cell that the snake head wants to enter, you have to make the logic run for all instances. So, unfortunately that means a texture read to read the grid cell that the snake head wants to enter into. That way, all instances have the information they need to run the same logic.

Technically, I guess it’s only the “game over” variable that needs to do the texture read, but it’s a lot easier and safer if you make all pixels run the same code deterministically. Also, having each pixel run the same code is likely a performance gain.

You might be able to squeeze some more performance out of being clever, but it’s real easy to bite you in the end too! In this case, the snake game runs well enough (60fps on my machine, even with the camera edge on which is the worst case) so I’m calling it good.

Numerical Problems

One gotcha to be wary of is that since you are storing game state variables in pixels, is that every variable is a vec4 of floats, and that they can only store floating point values between 0 and 1. So, if you are storing the position of a cell in a grid, you need to “normalize it” to be between 0 and 1.

Sadly, the floating point isn’t really guaranteed to be a full 32 bit floating point number either, and might be an 8 bit fixed point number or similar on some platforms. This makes it real easy to have numerical problems when you normalize your values into the 0 to 1 range.

Random Apple Spawning

When an apple is eaten (or the game is starting) an apple needs to spawn in a random location that isn’t yet taken by the snake’s body. This would mean generating random numbers in a loop until we found an empty spot. This could possibly be a perf problem for the pixel shader, possibly doing a large number of iterations, and possibly not finding a valid location by the time it ran out of it’s fixed size for loop indices.

I went with a different solution. When there is no apple spawned, the simulation generates a random position to put the apple at, and if it’s empty, it puts the apple there. If the position already was taken by a snake body part, it leaves the apple unspawned and tries again next frame. With a suitably high frame rate (it runs at 60fps for me!) it ought to be able to find an empty spot pretty quickly, even if you have a really large snake.

Frame Rate Independent Gameplay

When making games on any platform, it’s very easy to have problems with objects moving faster on machines that have higher FPS. The typical way to solve this is to keep track of how much time has elapsed between frames, keeping a total as the frames progress, and when the total gets above a certain point, you do a game update and reset the total back to zero (or the remainder after subtracting your “tick time” out of the total).

This was luckily very easy to do in shadertoy as well. There is a value you are provided called “iTimeDelta” which is how long the last frame took to render, in seconds. I divide this number by how many ticks i want per second so that the total will always be between 0 and 1, and then add it into a state variable (which is stored in a pixel). When doing the add, if the result is ever greater than 1, i do a tick, and reset it to 0.

Doing the tick just means moving the snake and handling whether the snake ate an apple, died, etc.

Low Latency Input

At first I had my input handling be handled inside of the tick. This was a problem because the tick happens 12 times a second, which means if you want the snake to change directions, you better have the key pressed down during one of those ticks. If you press it and then release it between ticks, the key press is lost and the snake doesn’t change direction and you die. This was extremely noticeable and made for really bad controls.

To solve this, I moved the key press handling to be OUTSIDE the tick, and had it be processed every frame. Instead of changing the snake’s direction right away though, I queued it up to be handled on the next tick. This way, you can quickly tap keys and they are registered as they should be on the next tick. The controls feel fine now and all is well.

Visualization & Rendering

The rendering of the snake game had fewer considerations than the simulation, but there are a couple things worth mentioning.

There are just a few parts to the rendering:

  1. If the ray hits the game board, make it look like the game board
  2. If the ray enters the small box of “play area” above the game board, it raytraces the contents of the game board
  3. If the ray misses everything, it does a lookup in a cube map texture to get the background
  4. If the ray hits anything, it uses some variables (normal, diffuse color, shinyness) from the collision to do the shading. It also does a lookup in the cube map texture to do some environment mapping to give the impression that reflection is happening, even though it isn’t

The second bullet item is the most interesting. What the code does is find the grid cells that the ray enters and leaves the play area, and then it walks from the start to the end, checking each cell to see if it has anything in it, and if so, doing a raytrace test against it.

At first I used the Bresenham line algorithm to walk the grid cells, but when there were holes in some of my spheres in strange places, I realized my problem. Bresenham isn’t meant to draw to (visit) every grid cell that a line passes through – it is only meant to draw to the most important ones!

So, I replaced Bresenham with a grid traversal algorithm and all was well.

The game is rendered using real time raytracing, but it only does PRIMARY rays. It doesn’t do reflection, refraction or shadows. This is because doing any of those things means raytracing against the game board (grid cells) more than once. Since we have to do a texture lookup per grid cell we want to raytrace, this could equal a lot of texture lookups as you can imagine. There is a hard limit to the number you can do (I’m pretty sure, although I don’t know what it is, and it probably varies from machine to machine), so I wanted to try and not push things too hard, and just stayed with primary rays only.

That’s All!

That’s basically it. Multipass rendering is an awesome feature in shadertoy. On one hand it makes me a bit sad getting it, because people don’t have to work within such tight constraints when doing the amazing things they do in shadertoy. On the other hand though, I think multipass rendering really ought to empower people to do some much more amazing things than in previous days – including but not limited to games.

I really look forward to learning from what other people are doing to learn more about how to leverage multipass rendering in interesting ways.

I also look forward to contributing my own game and non game multipass shaders. I have some ideas for both, keep an eye out (:

O(1) Data Lookups With Minimal Perfect Hashing

Hash tables are great in that you can hash a key and then use that hash as an index into an array to get the information associated with that key. That is very fast, so long as you use a fast hash function.

The story doesn’t end there though because hash functions can have collisions – multiple things can hash to the same value even though they are different. This means that in practice you have to do things like store a list of objects per hash value, and when doing a lookup, you need to do a slower full comparison against each item in the bucket to look for a match, instead of being able to only rely on the hash value comparisons.

What if our hash function didn’t have collisions though? What if we could take in N items, hash them, and get 0 to N-1 as output, with no collisions? This post will talk about how to make that very thing happen, with simple sample C++ code as well, believe it or not!

Minimal Perfect Hashing

Perfect hashing is a hash function which has no collisions. You can hash N items, and you get out N different hash values with no collisions. The number of items being hashed has to be smaller than or equal to the possible values your hash can give as output though. For instance if you have a 4 bit hash which gives 16 possible different hash values, and you hash 17 different items, you are going to have at least one collision, it’s unavoidable. But, if you hash 16 items or fewer, it’s possible in that situation that you could have a hash which had no collisions for the items you hashed.

Minimal perfect hashing takes this a step further. Not only are there no collisions, but when you hash N items, you get 0 to N-1 as output.

You might imagine that this is possible – because you could craft limitless numbers of hashing algorithms, and could pass any different salt value to it to change the output values it gives for inputs – but finding the specific hashing algorithm, and the specific salt value(s) to use sounds like a super hard brute force operation.

The method we are going to talk about today is indeed brute force, but it cleverly breaks the problem apart into smaller, easier to solve problems, which is pretty awesome. It’s actually a pretty simple algorithm too.

Here is how you create a minimal perfect hash table:

  1. Hash the items into buckets – there will be collisions at this point.
  2. Sort the buckets from most items to fewest items.
  3. Find a salt value for each bucket such that when all items in that bucket are hashed, they claim only unclaimed slots. Store this array of salt values for later. it will be needed when doing a data lookup.
  4. If a bucket only has one item, instead of searching for a salt value, you can just find an unclaimed index and store -(index+1) into the salt array.

Once you have your minimal perfect hash calculated, here is how you do a lookup:

  1. Hash the key, and use that hash to find what salt to use.
  2. If the salt is positive (or zero), hash the key again using the salt. The result of that hash will be an index into the data table.
  3. If the salt is negative, take the absolute value and subtract one to get the index in the data table.
  4. Since it’s possible the key being looked for isn’t in the table, compare the key with the key stored at that index in the table. If they match, return the data at that index. If they don’t match, the key was not in the table.

Pretty simple isn’t it?

Algorithm Characteristics

This perfect minimal hash algorithm is set up to be slow to generate, but fast to query. This makes it great for situations where the keys are static and unchanging, but you want fast lookup times – like for instance loading a data file for use in a game.

Interestingly though, while you can’t add new keys, you could make it so you could delete keys. You would just remove the key / value pair from the data set, and then when doing a lookup you’d find an empty slot.

Also, there is nothing about this algorithm that says you can’t modify the data associated with the keys, at runtime. Modifying the data associated with a key doesn’t affect where the key/value pair is stored in the table, so you can modify the data all you want.

If you wanted to be able to visit the items in a sorted order, when searching for the perfect minimal hash, you could also make the constraint that when looking for the salt values, that not only did the items in the bucket map to an unclaimed slot, you could make sure they mapped to the correct slot that they should be in to be in sorted order. That would increase the time it took to generate the table, and increase the chances that there was no valid solution for any salt values used, but it is possible if you desire being able to know the items in some sorted order.

Interestingly, the generation time of the minimal perfect hash apparently grows linearly with the number of items it acts on. That makes it scale well. On my own computer for instance, I am able to generate the table for 100,000 items in about 4.5 seconds.

Also, in my implementation, if you have N items, it has the next lower power of two number of salt values. You could decrease the number of salt values used if you wanted to use less memory, but that would again come at the cost of increased time to generate the table, as well as increase the chances that there was no valid solution for any salt values used.

Example Code

Below is a simple implementation of the algorithm described.

The main point of the code (besides functionality) is readability so it isn’t optimized as well as it could be, but still runs very fast (100,000 items processed in about 4.5 seconds on my machine). Debug is quite a bit slower than release for me though – I gave up on those same 100,000 items after a few minutes running in debug.

The code below uses MurmurHash2, but you could drop in whatever hashing algorithm you wanted.

The data file for this code is words.txt and comes to us courtesy of English Wordlists.

#include <vector>
#include <algorithm>
#include <assert.h>
#include <fstream>
#include <string>

unsigned int MurmurHash2(const void * key, int len, unsigned int seed);

//=================================================================================
template <typename VALUE>
class CPerfectHashTable {
public:

    typedef std::pair<std::string, VALUE> TKeyValuePair;
    typedef std::vector<TKeyValuePair> TKeyValueVector;
    struct SKeyValueVectorBucket {
        TKeyValueVector m_vector;
        size_t          m_bucketIndex;
    };
    typedef std::vector<struct SKeyValueVectorBucket> TKeyValueVectorBuckets;

    // Create the perfect hash table for the specified data items
    void Calculate (const TKeyValueVector& data) {

        // ----- Step 1: hash each data item and collect them into buckets.
        m_numItems = data.size();
        m_numBuckets = NumBucketsForSize(m_numItems);
        m_bucketMask = m_numBuckets - 1;
        m_salts.resize(m_numBuckets);
        m_data.resize(m_numItems);
        TKeyValueVectorBuckets buckets;
        buckets.resize(m_numBuckets);

        for (size_t i = 0; i < m_numBuckets; ++i)
            buckets[i].m_bucketIndex = i;

        for (const TKeyValuePair& pair : data) {
            size_t bucket = FirstHash(pair.first.c_str());
            buckets[bucket].m_vector.push_back(pair);
        }

        // ----- Step 2: sort buckets from most items to fewest items
        std::sort(
            buckets.begin(),
            buckets.end(),
            [](const SKeyValueVectorBucket& a, const SKeyValueVectorBucket& b) {
                return a.m_vector.size() > b.m_vector.size();
            }
        );

        // ----- Step 3: find salt values for each bucket such that when all items
        // are hashed with their bucket's salt value, that there are no collisions.
        // Note that we can stop when we hit a zero sized bucket since they are sorted
        // by length descending.
        std::vector<bool> slotsClaimed;
        slotsClaimed.resize(m_numItems);
        for (size_t bucketIndex = 0, bucketCount = buckets.size(); bucketIndex < bucketCount; ++bucketIndex) {
            if (buckets[bucketIndex].m_vector.size() == 0)
                break;
            FindSaltForBucket(buckets[bucketIndex], slotsClaimed);
        }
    }

    // Look up a value by key.  Get a pointer back.  null means not found.
    // You can modify data if you want, but you can't add/remove/modify keys without recalculating.
    VALUE* GetValue (const char* key) {

        // do the first hash lookup and get the salt to use
        size_t bucket = FirstHash(key);
        int salt = m_salts[bucket];

        // if the salt is negative, it's absolute value minus 1 is the index to use.
        size_t dataIndex;
        if (salt < 0)
            dataIndex = (size_t)((salt * -1) - 1);
        // else do the second hash lookup to get where the key value pair should be
        else
            dataIndex = MurmurHash2(key, strlen(key), (unsigned int)salt) % m_data.size();

        // if the keys match, we found it, else it doesn't exist in the table
        if (m_data[dataIndex].first.compare(key) == 0)
            return &m_data[dataIndex].second;
        return nullptr;
    }

private:

    unsigned int FirstHash (const char* key) {
        return MurmurHash2(key, strlen(key), 435) & m_bucketMask;
    }

    void FindSaltForBucket (const SKeyValueVectorBucket& bucket, std::vector<bool>& slotsClaimed) {

        // if the bucket size is 1, instead of looking for a salt, just encode the index to use in the salt.
        // store it as (index+1)*-1 so that we can use index 0 too.
        if (bucket.m_vector.size() == 1) {
            for (size_t i = 0, c = slotsClaimed.size(); i < c; ++i)
            {
                if (!slotsClaimed[i])
                {
                    slotsClaimed[i] = true;
                    m_salts[bucket.m_bucketIndex] = (i + 1)*-1;
                    m_data[i] = bucket.m_vector[0];
                    return;
                }
            }
            // we shouldn't ever get here
            assert(false);
        }

        // find the salt value for the items in this bucket that cause these items to take
        // only unclaimed slots
        for (int salt = 0; ; ++salt) {
            // if salt ever overflows to a negative number, that's a problem.
            assert(salt >= 0);
            std::vector<size_t> slotsThisBucket;
            bool success = std::all_of(
                bucket.m_vector.begin(),
                bucket.m_vector.end(),
                [this, &slotsThisBucket, salt, &slotsClaimed](const TKeyValuePair& keyValuePair) -> bool {
                    const char* key = keyValuePair.first.c_str();
                    unsigned int slotWanted = MurmurHash2(key, strlen(key), (unsigned int)salt) % m_numItems;
                    if (slotsClaimed[slotWanted])
                        return false;
                    if (std::find(slotsThisBucket.begin(), slotsThisBucket.end(), slotWanted) != slotsThisBucket.end())
                        return false;
                    slotsThisBucket.push_back(slotWanted);
                    return true;
                }
            );

            // When we find a salt value that fits our constraints, remember the salt
            // value and also claim all the buckets.
            if (success)
            {
                m_salts[bucket.m_bucketIndex] = salt;
                for (size_t i = 0, c = bucket.m_vector.size(); i < c; ++i)
                {
                    m_data[slotsThisBucket[i]] = bucket.m_vector[i];
                    slotsClaimed[slotsThisBucket[i]] = true;
                }
                return;
            }
        }
    }

    static size_t NumBucketsForSize (size_t size) {
        // returns how many buckets should be used for a specific number of elements.
        // Just uses the power of 2 lower than the size, or 1, whichever is bigger.
        if (!size)
            return 1;

        size_t ret = 1;
        size = size >> 1;
        while (size) {
            ret = ret << 1;
            size = size >> 1;
        }
        return ret;
    }

    // When doing a lookup, a first hash is done to find what salt to use
    // for the second hash.  This table stores the salt values for that second
    // hash.
    std::vector<int> m_salts;

    // NOTE: this stores both the key and the value.  This is to handle the
    // situation where a key is searched for which doesn't exist in the table.
    // That key will still hash to some valid index, so we need to detect that
    // it isn't the right key for that index.  If you are never going to look for
    // nonexistant keys, then you can "throw away the keys" and only store the
    // values.  That can be a nice memory savings.
    std::vector<TKeyValuePair> m_data;

    // useful values
    size_t m_numItems;
    size_t m_numBuckets;
    size_t m_bucketMask;
};

// MurmurHash code was taken from https://sites.google.com/site/murmurhash/
//=================================================================================
// MurmurHash2, by Austin Appleby
 
// Note - This code makes a few assumptions about how your machine behaves -
 
// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4
 
// And it has a few limitations -
 
// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.
 
unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
    // 'm' and 'r' are mixing constants generated offline.
    // They're not really 'magic', they just happen to work well.
 
    const unsigned int m = 0x5bd1e995;
    const int r = 24;
 
    // Initialize the hash to a 'random' value
 
    unsigned int h = seed ^ len;
 
    // Mix 4 bytes at a time into the hash
 
    const unsigned char * data = (const unsigned char *)key;
 
    while(len >= 4)
    {
        unsigned int k = *(unsigned int *)data;
 
        k *= m; 
        k ^= k >> r; 
        k *= m; 
         
        h *= m; 
        h ^= k;
 
        data += 4;
        len -= 4;
    }
     
    // Handle the last few bytes of the input array
 
    switch(len)
    {
    case 3: h ^= data[2] << 16;
    case 2: h ^= data[1] << 8;
    case 1: h ^= data[0];
            h *= m;
    };
 
    // Do a few final mixes of the hash to ensure the last few
    // bytes are well-incorporated.
 
    h ^= h >> 13;
    h *= m;
    h ^= h >> 15;
 
    return h;
}


//=================================================================================
void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}

//=================================================================================
int main (int argc, char **argv)
{
    // Read the data entries from a file.  Use the line as the key, and the line
    // number as the data.  Limit it to 100,000 entries.
    printf("Loading Data...");
    CPerfectHashTable<int> table;
    decltype(table)::TKeyValueVector data;
    std::ifstream file("words.txt");
    std::string str;
    int i = 0;
    while (std::getline(file, str) && i < 100000)
    {
        data.push_back(std::make_pair(str, i));
        ++i;
    }
    printf("Donenn");

    // make the perfect hash table
    printf("Generating Minimal Perfect Hash Table...");
    table.Calculate(data);
    printf("Donenn");

    // Verify results
    printf("Verifying results...");
    for (decltype(table)::TKeyValuePair& keyValuePair : data) {
        int* value = table.GetValue(keyValuePair.first.c_str());
        assert(value != nullptr);
        if (value == nullptr)
        {
            printf("Error, could not find data for key "%s"n", keyValuePair.first.c_str());
            WaitForEnter();
            return 0;
        }
        assert(*value == keyValuePair.second);
        if (*value != keyValuePair.second)
        {
            printf("  table["%s"] = %in", keyValuePair.first.c_str(), *value);
            printf("Incorrect value detected, should have gotten %i!n", keyValuePair.second);
            WaitForEnter();
            return 0;
        }
    }
    printf("Donenn");

    WaitForEnter();
    return 0;
}

Links & More

I learned the details of this algorithm from this page: Steve Hanov’s Blog: Throw away the keys: Easy, Minimal Perfect Hashing, after hearing about the technique mentioned occasionally by colleagues.

There are other ways to do minimal perfect hashing however. For instance give this a read: Minimal Perfect Hashing

One place that method is better than this one, is that in this one, when doing a lookup you have to hash the key twice. In the technique described by the last technique, you only have to hash the key once, and use that hash to combine the results of two lookups. The two lookups are not dependant so can be re-ordered or happen concurrently, which makes it faster on modern CPUs.

Apparently there are also some techniques for generating the minimal perfect hash on a large number of items by breaking them apart into smaller sets, which can then be paralelized across threads.

I also wanted to mention that a large part of the in memory representation of this data structure can come from storing the keys along with the data, to verify that you have indeed found the item you are looking for after doing the O(1) lookup. If you are in a situation where you are only ever going to be searching for keys that you know are in the table, you can actually forgo storing the keys, to bring the memory requirements down a significant amount.

Also, The example code implements this as a hash table, but you could also use this as a set, if you wanted fast membership tests. You could either store the keys, to be able to resolve when things were not part of the set, or, you could make a hash table of string to bool, where the bool specified whether something was in the set. Which one is better depends on your specific needs and whether you plan to search for unknown keys or not.

Lastly, as a byproduct of using a data structure like this, you can get a unique ID per key, which is the index that it appears in the data array. This can be super helpful if you have something like a list of filenames, where you would rather work with integers instead of specific file names, since integers are quicker and easier to compare, copy around, etc.

You could even make this data structure support lookups by either key or by unique id. This way, you could do a by key lookup the first time you needed to find something, and then could store off the ID to get it by ID from then on for a faster lookup. Heck, you could even do all your lookups “offline” (at tool time, not when the app is running) and for instance convert all file references in data files to be the each file’s unique ID instead. Then, you could toss out the keys of your data structure, only storing an array of data, and using the unique file ID to look into that table whenever you wanted to read or write meta data associated with that file. That would make lookups faster, and also decrease the memory requirements of memory resident data files.

It’s pretty cool stuff, and a pretty useful algorithm. Hopefully you find it useful as well (:

Update – 12/22/15

Interestingly, this post has had something like 15,000 readers since it was posted. That is by far the most read post on this blog 😛

Anyways, I wanted to add some more info I’ve found recently.

Here are three tools for doing minimal perfect hashing that are very likely to give you better results than the algorithm I describe above:

Here’s a conversation talking about gperf and the alternative applications, and pros and cons for each:
Stack Overflow: Making a perfect hash (all consecutive buckets full), gperf or alternatives?

Here’s a research paper on gperf by Doug Schmidt: GPERF – A Perfect Hash Function Generator

I had a thought that maybe there was some play here by using “logical synthesis” to come up with some algorithm to map the inputs (the keys of the hash table) to the outputs (collision free output indices).

I started looking into Karnaugh maps and then the Quine–McCluskey algorithm, and then espresso and espresso-exact (mincov). Where the first two things are decent at solving multi bit input to single bit output, the second two things are decent at solving multi bit input to multi bit output, allowing operations to be shared among bits.

While I haven’t found anyone using those specific algorithms to solve the problem, people have, and definitely are still, trying to also look into the ability to generate code without lookups. From what I’ve read so far, it sounds like finding such a function takes a lot longer to find and also that it runs more slowly in practice than a less perfect solution which has lookups.

Either way, this is still an active area of research, and plenty of folks are working on it so I’m going to leave it to them.

I also sort of have the feeling that if you are in need of minimal perfect hashing, you may be “doing it wrong”. For instance, if you are at all able to, you probably are likely to be better off having a pre-agreed on set of unique IDs per “thing” you want to look up. These unique IDs can be used directly as array indices for the magical always O(1) lookup that minimal perfect hashing is going for, and is actually a quicker lookup in practice since you don’t need to jump through special hoops to calculate the array index.

The only exceptions I can think of are:

  1. Real world requirements and not being able to reach the ideal situation – this is the boat I’m in right now. Ideally, systems would be asking about things by an integer ID, but I can’t get there in a single step, so the perfect hashing is a temporary bridge til I can get there.
  2. Even with IDs, sparse arrays are still problematic. You may have an ID per file that could be asked about, but say that you have 1,000,000 files, but you want to make an array of data for only 10,000 of them. How do you take the unique ID of a file and do a lookup into that smaller table? Minimal perfect hashing seems to be useful in this situation, but there may be other decent or comparable alternatives if you are already using integer IDs.

Improving the Security of the Super Simple Symmetric Leveled Homomorphic Encryption Implementation

The last post showed a super simple encryption algorithm that let an untrusted person perform calculations with encrypted data such that when they gave the results back to you, you could decrypt them and get the results of the calculation as if they were done on the unencrypted values. The best part was that this untrusted party had no knowledge of the data it was doing the calculations on.

While it was (hopefully) easy to understand, there were a lot of problems with it’s security. The biggest of which probably was the fact that the encryption of a false bit was the secret key itself!

This post is going to slightly increase the complexity of the encryption operation to match what the paper that I’m getting this stuff from says to do (Fully Homomorphic Encryption over the Integers).

All of the other operations – like XOR, AND, Decryption, use of constants – remain the same. It’s just the process of turning a plain text bit into a cipher text bit that is going to change.

Disclaimer: I am a game programmer, not a cryptographer, and you should REALLY do your own investigation and consult experts before actually rolling anything out to production or trusting that what I say is correct!

Improvement #1 – Multiply Key By Random Number

The scheme to encrypt a plaintext bit from the last blog post was this:

encryptedBit = key + value ? 1 : 0;

A major problem with that scheme is that if you encrypt a false bit, you get the key itself.

Another problem that we touched on was that the parity of the plain text bit (whether it was odd or even, aka a 1 or a 0) was always opposite of the parity of the cipher text.

Yet another problem was that for the same key, 0 always encrypted to the same value, and 1 always encrypted to the same value, which was the “0” encrypted value, plus 1.

We are going to address all of these problems by adding a simple operation to encryption. We are just going to multiply the key by a random number before adding the plain text bit in. That gives us the following:

encryptedBit = key*randomNumber + value ? 1 : 0;

Where randomNumber is at least 1.

This above helps in the following ways:

  • Encrypting false doesn’t always just give you the key anymore!
  • Since the cipherBit divided by the key is now a random number (and since the key is an odd number), it will be random whether the parity of the cipher text matches or mismatches the plain text. You can no longer use that information to figure out the value of the encrypted bit!
  • If you encrypt a false value, you will get different values each time. Same when encrypting a true value. When looking at two ciphertexts that have the same underlying plaintext value, you will no longer be able to tell that they are equal just by looking at them, or be able to tell that one is larger than the other so must be the true bit!

That is a pretty good improvement, but we can do a little better.

Improvement #2 – Add Random Noise

The second improvement we are going to do is add random noise to our encrypted value. This will make it so encrypting a false bit will not result in a multiple of the key, but will instead result in NEARLY a multiple of the key, which is a harder problem to figure out as an attacker.

You might ask how we are going to preserve our encrypted value if we are adding random noise into the result. Well, in actuality, all we really need to preserve is the lowest bit, so we are going to add an EVEN NUMBERED amount of noise.

That makes our encryption scheme become this:

encryptedBit = key*randomNumber1 + 2*randomNumber2 + value ? 1 : 0;

While this increases our security, it also increases the noise (error) in our encrypted data, which makes it so we can do fewer operations before the error gets too large and we start getting wrong answers.

There we are, our encryption scheme now matches the one described in the paper. All the rest of our operations remain unchanged.

Security

The above looks good, but what range of values should be we use for randomNumber1 and randomNumber2 and the actual size of the key?

The paper refers to a security parameter lambda (\lambda) that everything else is based on.

It says that the size of the key (N) should be (\lambda^2) bits, randomNumber1 should be around 2^{N^3} (N^3 bits) and that randomNumber2 should be around 2^{\sqrt{N}} (\sqrt{N} bits).

It also says that the best known attack against this algorithm takes about 2^{N^2} operations and that you can assume an attacker is going to be able to do a billion operations per second (per this info here). Note that 1 billion operations per second is a typical computer, not a super computer or a distributed attack using a bot network!

Let’s look at some example values for lambda!

Lambda Key Size RN1 Size RN2 Size Attack Time
80 800B 30.5GB 10B 38 million years
60 450B 5.4GB 8B 36 years
40 200B 488MB 5B 17 minutes
20 50B 7.6MB 3B < 1 second

Ouch! RandomNumber1 sure is huge isn’t it? I’ve double and tripple checked and that really does seem to be correct. Encrypting a single bit is essentially going to be as large as RandomNumber1. That is so unwieldy it’s ridiculous. I’m going to quadruple check that I think because that is just insane…

BTW quick tangent. An interesting thing to note is that if your key is an even number, instead of an odd number, noise/error can grow as much as it wants, and will never give you the wrong result! That means that by using an even numbered key, this scheme is fully homomorphic. However, using an even key, encryptedBit % 2 == decryptedBit, so it’s super insecure.

I’ve been thinking about it quite a bit, but I can’t think of any avenues to use an even numbered key but still have any semblance of real security. If you can think of a way, go publish a paper about it and become famous! 😛

Example Code

Here is some sample code from last post with the new encryption routine. I had to decrease the number of bits in the addition tests, and I had to tone down the security quite a bit to make it fit within 64 bits.

#include <stdio.h>
#include <stdint.h>
#include <random>
#include <array>
#include <inttypes.h>

typedef uint64_t uint64;

// Increase this value to increase the size of the key, and also the maximum
// size of the error allowed.
// If you set it too high, operations will fail when they run out of storage space
// in the 64 bit ints.  If you set it too low, you will not be able to do very many
// operations in a row.
// The recomended values for good security for these numbers are way too large to
// fit in a uint64, so adjusting them down to show their effects while using uint64s
const size_t c_numKeyBits = 15;
const size_t c_numNoiseBits = 3; //size_t(sqrt(c_numKeyBits));
const size_t c_numMultiplierBits = 4; //c_numKeyBits * c_numKeyBits * c_numKeyBits;

#define Assert(x) if (!(x)) ((int*)nullptr)[0] = 0;

//=================================================================================
// TODO: Replace with something crypto secure if desired!
uint64 RandomUint64 (uint64 min, uint64 max)
{
    static std::random_device rd;
    static std::mt19937 gen(rd());
    std::uniform_int_distribution<uint64> dis(min, max);
    return dis(gen);
}

//=================================================================================
void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}

//=================================================================================
uint64 GenerateKey ()
{
    // Generate an odd random number in [2^(N-1), 2^N)
    // N is the number of bits in our key
    // The key also defines the maximum amount of error allowed, and thus the number
    // of operations allowed in a row.
    uint64 key = RandomUint64(0, (uint64(1) << uint64(c_numKeyBits)) - 1);
    key = key | (uint64(1) << uint64(c_numKeyBits - 1));
    key = key | 1;
    return key;
}

//=================================================================================
bool Decrypt (uint64 key, uint64 value)
{
    return ((value % key) % 2) == 1;
}

//=================================================================================
uint64 Encrypt (uint64 key, bool value)
{
    uint64 keyMultiplier = RandomUint64(0, (1 << c_numMultiplierBits) - 2) + 1;
    uint64 noise = RandomUint64(0, (1 << c_numNoiseBits) - 1);
    uint64 ret = key * keyMultiplier + 2 * noise + (value ? 1 : 0);
    Assert(Decrypt(key, ret) == value);
    return ret;
}

//=================================================================================
uint64 XOR (uint64 A, uint64 B)
{
    return A + B;
}

//=================================================================================
uint64 AND (uint64 A, uint64 B)
{
    return A * B;
}

//=================================================================================
float GetErrorPercent (uint64 key, uint64 value)
{
    // Returns what % of maximum error this value has in it.  When error >= 100%
    // then we have hit our limit and start getting wrong answers.
    return 100.0f * float(value % key) / float(key);
}

//=================================================================================
uint64 FullAdder (uint64 A, uint64 B, uint64 &carryBit)
{
    // homomorphically add the encrypted bits A and B
    // return the single bit sum, and put the carry bit into carryBit
    // From http://en.wikipedia.org/w/index.php?title=Adder_(electronics)&oldid=381607326#Full_adder
    uint64 sumBit = XOR(XOR(A, B), carryBit);
    carryBit = XOR(AND(A, B), AND(carryBit, XOR(A, B)));
    return sumBit;
}

//=================================================================================
int main (int argc, char **argv)
{
    // run this test a bunch to show that it works.  If you get a divide by zero
    // in an Assert, that means that it failed, and hopefully it's because you
    // increased c_numKeyBits to be too large!
    printf("Verifying 10000 truth tables.  Details of first one:n");
    for (int index = 0; index < 10000; ++index)
    {
        // make our key and a true and false bit
        uint64 key = GenerateKey();
        uint64 falseBit1 = Encrypt(key, false);
        uint64 falseBit2 = Encrypt(key, false);
        uint64 trueBit1  = Encrypt(key, true);
        uint64 trueBit2  = Encrypt(key, true);

        // report the results for the first iteration of the loop
        if (index == 0)
        {
            printf("Key 0x%" PRIx64 ", false = 0x%" PRIx64 ", 0x%" PRIx64 " true = 0x%" PRIx64 " 0x%" PRIx64 "n", key, falseBit1, falseBit2, trueBit1, trueBit2);
            printf("  [0 xor 0] = 0   0x%" PRIx64 " xor(+) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%0.2f%%)n", falseBit1, falseBit2, XOR(falseBit1, falseBit2), Decrypt(key, XOR(falseBit1, falseBit2)) ? 1 : 0, GetErrorPercent(key, XOR(falseBit1, falseBit2)));
            printf("  [0 xor 1] = 1   0x%" PRIx64 " xor(+) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%0.2f%%)n", falseBit1, trueBit2 , XOR(falseBit1, trueBit2 ), Decrypt(key, XOR(falseBit1, trueBit2 )) ? 1 : 0, GetErrorPercent(key, XOR(falseBit1, trueBit2 )));
            printf("  [1 xor 0] = 1   0x%" PRIx64 " xor(+) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%0.2f%%)n", trueBit1 , falseBit2, XOR(trueBit1 , falseBit2), Decrypt(key, XOR(trueBit1 , falseBit2)) ? 1 : 0, GetErrorPercent(key, XOR(trueBit1 , falseBit2)));
            printf("  [1 xor 1] = 0   0x%" PRIx64 " xor(+) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%0.2f%%)n", trueBit1 , trueBit2 , XOR(trueBit1 , trueBit2 ), Decrypt(key, XOR(trueBit1 , trueBit2 )) ? 1 : 0, GetErrorPercent(key, XOR(trueBit1 , trueBit2 )));
            printf("  [0 and 0] = 0   0x%" PRIx64 " and(*) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%0.2f%%)n", falseBit1, falseBit2, AND(falseBit1, falseBit2), Decrypt(key, AND(falseBit1, falseBit2)) ? 1 : 0, GetErrorPercent(key, XOR(falseBit1, falseBit2)));
            printf("  [0 and 1] = 0   0x%" PRIx64 " and(*) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%0.2f%%)n", falseBit1, trueBit2 , AND(falseBit1, trueBit2 ), Decrypt(key, AND(falseBit1, trueBit2 )) ? 1 : 0, GetErrorPercent(key, XOR(falseBit1, trueBit2 )));
            printf("  [1 and 0] = 0   0x%" PRIx64 " and(*) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%0.2f%%)n", trueBit1 , falseBit2, AND(trueBit1 , falseBit2), Decrypt(key, AND(trueBit1 , falseBit2)) ? 1 : 0, GetErrorPercent(key, XOR(trueBit1 , falseBit2)));
            printf("  [1 and 1] = 1   0x%" PRIx64 " and(*) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%0.2f%%)n", trueBit1 , trueBit2 , AND(trueBit1 , trueBit2 ), Decrypt(key, AND(trueBit1 , trueBit2 )) ? 1 : 0, GetErrorPercent(key, XOR(trueBit1 , trueBit2 )));
        }

        // Verify truth tables for XOR and AND
        Assert(Decrypt(key, XOR(falseBit1, falseBit2)) == false);
        Assert(Decrypt(key, XOR(falseBit1, trueBit2 )) == true );
        Assert(Decrypt(key, XOR(trueBit1 , falseBit2)) == true );
        Assert(Decrypt(key, XOR(trueBit1 , trueBit2 )) == false);

        Assert(Decrypt(key, AND(falseBit1, falseBit2)) == false);
        Assert(Decrypt(key, AND(falseBit1, trueBit2 )) == false);
        Assert(Decrypt(key, AND(trueBit1 , falseBit2)) == false);
        Assert(Decrypt(key, AND(trueBit1 , trueBit2 )) == true );
    }

    // Do multi bit addition as an example of using compound circuits to
    // do meaningful work.
    const size_t c_numBitsAdded = 3;
    printf("nDoing 10000 Multibit Additions.  Details of first one:n");
    std::array<uint64, c_numBitsAdded> numberAEncrypted;
    std::array<uint64, c_numBitsAdded> numberBEncrypted;
    std::array<uint64, c_numBitsAdded> resultEncrypted;
    std::array<uint64, c_numBitsAdded> carryEncrypted;
    for (int index = 0; index < 10000; ++index)
    {
        // generate the numbers we want to add
        uint64 numberA = RandomUint64(0, (1 << c_numBitsAdded) - 1);
        uint64 numberB = RandomUint64(0, (1 << c_numBitsAdded) - 1);

        // generate our key
        uint64 key = GenerateKey();

        // encrypt our bits
        for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
        {
            numberAEncrypted[bitIndex] = Encrypt(key, (numberA & (uint64(1) << uint64(bitIndex))) != 0);
            numberBEncrypted[bitIndex] = Encrypt(key, (numberB & (uint64(1) << uint64(bitIndex))) != 0);
        }

        // do our multi bit addition!
        // we could initialize the carry bit to 0 or the encrypted value of 0. either one works since 0 and 1
        // are also poor encryptions of 0 and 1 in this scheme!
        uint64 carryBit = Encrypt(key, false);
        for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
        {
            carryEncrypted[bitIndex] = carryBit;
            resultEncrypted[bitIndex] = FullAdder(numberAEncrypted[bitIndex], numberBEncrypted[bitIndex], carryBit);
        }

        // decrypt our result
        uint64 resultDecrypted = 0;
        for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
        {
            if (Decrypt(key, resultEncrypted[bitIndex]))
                resultDecrypted |= uint64(1) << uint64(bitIndex);
        }

        // report the results for the first iteration of the loop
        if (index == 0)
        {
            printf("Key 0x%" PRIx64 ", %" PRId64 " + %" PRId64 " in %i bits = %" PRId64 "n", key, numberA, numberB, c_numBitsAdded, (numberA + numberB) % (1 << c_numBitsAdded));
            for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
                printf("  A[%i] = 0x%" PRIx64 " (%i err=%0.2f%%)n", bitIndex, numberAEncrypted[bitIndex], Decrypt(key, numberAEncrypted[bitIndex]), GetErrorPercent(key, numberAEncrypted[bitIndex]));
            printf("+n");
            for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
                printf("  B[%i] = 0x%" PRIx64 " (%i err=%0.2f%%)n", bitIndex, numberBEncrypted[bitIndex], Decrypt(key, numberBEncrypted[bitIndex]), GetErrorPercent(key, numberBEncrypted[bitIndex]));
            printf("=n");
            for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
                printf("  Result[%i] = 0x%" PRIx64 " (%i err=%0.2f%%)n", bitIndex, resultEncrypted[bitIndex], Decrypt(key, resultEncrypted[bitIndex]), GetErrorPercent(key, resultEncrypted[bitIndex]));
            printf("Carry Bits =n");
            for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
                printf("  Result[%i] = 0x%" PRIx64 " (%i err=%0.2f%%)n", bitIndex, carryEncrypted[bitIndex], Decrypt(key, carryEncrypted[bitIndex]), GetErrorPercent(key, carryEncrypted[bitIndex]));
            printf("result decrypted = %" PRId64 "n", resultDecrypted);
        }

        // make sure that the results match, keeping in mind that the 4 bit encryption may have rolled over
        Assert(resultDecrypted == ((numberA + numberB) % (1 << c_numBitsAdded)));
    }

    WaitForEnter();
    return 0;
}

Here’s the output of an example run.

Links

Another paper about symmetric HE over the integers: Symmetric Somewhat Homomorphic Encryption over the Integers

An implementation of FHE: Implementation of the DGHV fully homomorphic encryption scheme

Next Up

Here are some interesting avenues to explore with this stuff. More blog posts coming in the future about these topics, but for now, here they are in case you want to check them out yourself:

  • Making the public / private key implementation
  • Make this stuff work with multi precision math to be able to make realistically sized and see what sort of perf it gives
  • Show how to achieve fully homomorphic encryption using boot strapping and/or modulus switching
  • Further explore tuning down security parameters to get game quality HE
  • Explore the other known methods for implementing HE. HE over the integers is apparently very easy to understand compared to other methods, but other methods may have different characteristics – like maybe not being crazy gigantic

Super Simple Symmetric Leveled Homomorphic Encryption Implementation

Homomorphic encryption is a pretty interesting thing. It allows you to do calculations on encrypted data such that when you decrypt the results, it’s as if you did the calculations on the unencrypted data. This allows computation to happen without the person doing the computation knowing what the data actually is!

Brief History

For a long time, cryptographers wondered if fully homomorphic encryption was even possible. There were various encryption algorithms that could perform SOME operations homomorphically (RSA can do multiplication for instance!), but there weren’t any that could do ALL operations. In other words, you couldn’t execute arbitrary computations.

Those types of algorithms are called “Partially Homomorphic Encryption” or PHE.

Another problem standing in the way of fully homomorphic encryption was that many algorithms would only have a limited count of operations they could perform before error would accumulate and they would start giving incorrect answers. In essence they were limited to evaluating low degree polynomials.

Those types of algorithms are called “Somewhat Homomorphic Encryption” or SWHE.

In contrast, Fully Homomorphic Encryption (FHE) can perform an unlimited number of homomorphic operations, and it can perform any operation homomorphically. It is unbounded in both ways.

Amazingly, In 2009 Craig Gentry figured out the first fully homomorphic encryption scheme! With his setup, you can calculate both XOR and AND on encrypted bits, which makes it Turing complete. It is also able to keep errors from becoming too large by using an ingenious bootstrapping technique to decrease accumulated error. Here’s a link to his PHd thesis: A Fully Homomorphic Encryption Scheme.

Unfortunately, the current implementations of secure FHE take too much computational power to be practical in most situations – like 30 minutes to calculate an AND between 2 bits!

In this post I’m going to show you a super simple HE implementation that will be very easy to understand. It won’t be fully homomorphic, but it will be “leveled” (or, somewhat homomorphic), meaning it is Turing complete, but the count of calculations you can perform are limited due to error creeping in. It also won’t be secure – due to making it easy to understand – but it will be lightning fast.

This will be a symmetric key algorithm, but as we’ll explore in future posts, it can also be used for public key algorithms.

Why Is HE Useful?

One thing you could do with HE is store your financial transactions encrypted on a server. The server could run queries and calculations on your financial data and send back the results. You could then unencrypt the result and see what the values are, even though the server itself – which generated the values – has no idea what the numbers actually are.

Another use could be in games. Whether you are playing a first person shooter, or a real time strategy game, many different types of games send information about each player to every other player in the game. Hashes of game state can be used to make sure that everyone is in agreement about calculations to prevent a player from cheating by WRITING to a value they shouldn’t be writing to (or, at least you can detect when they do, and use majority rule to boot them out of the game), but how do you stop a player from READING a value they shouldn’t be reading?

Using HE, you could encrypt the data you need to send to players that they shouldn’t be able to read. With this, they could still do game play logic calculations on the data, and calculate hashes of the encrypted results to ensure that all players were in agreement, but with HE, they wouldn’t gain knowledge of the data they were working with.

In other words, player A could verify that player B’s state is correct and they haven’t cheated, without player A getting details about player B’s state.

In theory this could eliminate or at least help combat things like wall hacks and other “data read” based cheats. In practice there would be some complications to work out, even if it wasn’t crazy slow to calculate, but the fact that there is a path to addressing these issues is pretty exciting! People are working on improving speed, and games don’t need the same level of security that other usage cases do.

How To Do It

Here are the details of this super simple leveled homomorphic symmetric key algorithm.

By the way, all the percent signs below mean “modulus” which is just the remainder of a division. 25 % 4 = 1 for instance, because 25/4 = 6 with a remainder of 1. That remainder of 1 is what we get when we take the modulus. A term you’ll see more often if reading through this stuff on your own will be “residue”. Don’t let that word scare you, it is just another name for the remainder.

Making A Secret Key

To make a key, generate an odd random number between 2^(N-1) and 2^N. In other words, it will be N random bits, except the highest and lowest bit will be set to 1. N is the size of your secret key. Larger keys are more secure, and allow more computations to be done in a row, but they also take more storage space. If you are using a fixed size int – like say a uint32 – a larger key will make you run out of those 32 bits sooner.

key = RandomNumber(0, (1 << N) - 1) | 1 | (1 << (N - 1));

Encrypt

To encrypt a bit, the encrypted value is just the key plus the value of the unencrypted bit (0 or 1).

encryptedBit = key + value ? 1 : 0;

Decrypt

To decrypt a bit, you take the encrypted bit modulo the key, and then modulo 2.

decryptedBit = (encryptedBit % key) % 2;

XOR

To do an XOR of two encrypted bits, you just add the two values together.

xorResult = encryptedBit1 + encryptedBit2;

AND

To do an AND of two encrypted bits, you just multiply the two values together.

andResult = encryptedBit1 * encryptedBit2;

Example

Let’s run through an example to see this in action.

We’ll use a 4 bit key, and say that the key is 13 (1101 in binary).

Let’s encrypt some bits:

trueBitEncrypted = key + 1 = 13 + 1 = 14 \newline falseBitEncrypted = key + 0 = 13 + 0 = 13

Let’s do some logical operations:

Xor00 = falseBitEncrypted + falseBitEncrypted = 13 + 13 = 26 \newline Xor01 = falseBitEncrypted + trueBitEncrypted  = 13 + 14 = 27 \newline Xor10 = trueBitEncrypted  + falseBitEncrypted = 14 + 13 = 27 \newline Xor11 = trueBitEncrypted  + trueBitEncrypted  = 14 + 14 = 28 \newline \newline And00 = falseBitEncrypted * falseBitEncrypted = 13 * 13 = 169 \newline And01 = falseBitEncrypted * trueBitEncrypted  = 13 * 14 = 182 \newline And10 = trueBitEncrypted  * falseBitEncrypted = 14 * 13 = 182 \newline And11 = trueBitEncrypted  * trueBitEncrypted  = 14 * 14 = 196 \newline \newline FalseXorFalseAndTrue = falseBitEncrypted + falseBitEncrypted * trueBitEncrypted = 13 + 13 * 14 = 195

Notice how AND is a multiplication where XOR is an addition, and that the result of an AND operation is a larger number than an XOR operation. This means that if you are working with a specific sized number (again, such as a uint32), that you can do fewer ANDs than XORs before you run out of bits. When you run out of bits and your number has integer overflow, you have hit the cieling of this leveled HE scheme. That means that ANDs are more expensive than XORs when considering the number of computations you can do.

Ok, time to decrypt our XOR values!

Xor00Decrypted = ((Xor00 \% key) \% 2) = (26 \% 13) \% 2 = 0 \newline Xor01Decrypted = ((Xor01 \% key) \% 2) = (27 \% 13) \% 2 = 1 \newline Xor10Decrypted = ((Xor10 \% key) \% 2) = (27 \% 13) \% 2 = 1 \newline Xor11Decrypted = ((Xor11 \% key) \% 2) = (28 \% 13) \% 2 = 0 \newline

XOR is looking correct, how about AND?

And00Decrypted = ((And00 \% key) \% 2) = (169 \% 13) \% 2 = 0 \newline And01Decrypted = ((And01 \% key) \% 2) = (182 \% 13) \% 2 = 0 \newline And10Decrypted = ((And10 \% key) \% 2) = (182 \% 13) \% 2 = 0 \newline And11Decrypted = ((And11 \% key) \% 2) = (196 \% 13) \% 2 = 1 \newline

AND is looking good as well. Lastly let’s decrypt the compound operation:

FalseXorFalseAndTrueDecrypted = ((FalseXorFalseAndTrue \% key) \% 2) = (195 \% 13) \% 2 = 0

Lookin good!

Intuition

Let’s get some intuition for why this works…

Key Generation

First up, why is it that the key needs to have it’s high bit set? Well, on one hand, larger keys are more secure, and allow more room for error accumulation so allow more operations to be done. On the other hand, this is kind of misleading to say. If you generate ANY random odd integer, there will be a highest bit set to 1 SOMEWHERE. You technically don’t need to store the zeros above that. So i guess you could look at it like you are just generating ANY random odd integer, and you could figure out N FROM that value (the position of the highest bit). Thinking about it the way we do though, it lets us specify how many bits we actually want to commit to for the key which gives us more consistent behavior, upper bound storage space, etc.

Secondly, why does the key need to be odd?

Let’s say that you have two numbers A and B where A represents an encrypted bit and B represents the encryption key. If B is even, then A % B will always have the same parity (whether it’s even or odd) as A. Since we are trying to hide whether our encrypted bit is 0 or 1 (even or odd), that makes it very bad encryption since you can recover the plain text bit by doing encryptedValue % 2. If on the other hand, B is odd, A % B will have the same parity as A only if A / B is even.

This doesn’t really make much of a difference in the scheme in this post, because A / B will always be 1 (since the encrypted bit is the key plus the plain text bit), but in the next scheme it is more important because A / B will be a random number, which means that it will be random with a 50/50 chance whether or not the parity of the encrypted bit matches the parity of the plain text bit. Since it’s an even chance whether it matches or not, that means that an attacker can’t use that information to their advantage.

While it’s true that when generating a random key, there is a 50/50 chance of whether you will get an even or odd key, you can see how we’d be in a situation where 75% of the time the parity of the ciphertext would match the parity of the plaintext if we allowed both even and off keys.

That would mean that while an attacker couldn’t know for CERTAIN whether an encrypted bit is 1 or 0 based on the cipher text, they can guess with 75% confidence that the unencrypted bit will just be the cipher text % 2, which is no good! So, we are better off sticking with an odd numbered key in this scheme. But again, that won’t really matter until the next post!

XOR as Addition

I know that I’m going to butcher this explanation a bit in the eyes of someone who knows this math stuff better than me. If you are reading this and see that I have indeed done that, please drop me a line or leave a comment and let me know what I’ve missed or could explain better. I suspect there’s something about rings going on here (;

Believe it or not, when you add two numbers together and then take the modulus, you get the same answer as if you did the modulus on the two numbers, added them together, and then took the modulus again.

In other words, adding two numbers can be seen as adding their residue (remainder).

Let me show you an example.

15 + 28 = 43 \newline \newline ((15 \% 13) + (28 \% 13)) \% 13 = 43 \% 13 \newline (2 + 2) \% 13 = 4 \newline 4 = 4

Let’s try another one. I’m picking these numbers “randomly” out of my head 😛

28 + 47 = 75 \newline \newline ((28 \% 8) + (47 \% 8)) \% 8 = 75 \% 8 \newline (4 + 7) \% 8 = 3 \newline 3 = 3

OK makes sense, but who cares about that?

Well, believe it or not, 1 bit addition is the same as XOR! This means that you can add numbers together, which adds the modulus of their key together, which then in turn adds that number mod 2 together, to preserve the encrypted parity (odd or even-ness).

Check out this 2 bit binary math. Keep in mind that with 1 bit results, you would only keep the right most binary digit. I’m showing two digits to show you that it is in fact binary addition, and that the right most bit is in fact the same as XOR.

0 + 0 = 00 \newline 0 + 1 = 01 \newline 1 + 0 = 01 \newline 1 + 1 = 10

One thing to note before we move on is that since we are doing a modulus against the key, when the remainder gets to be too large it rolls over. When it rolls over, we start getting the wrong answers and have hit our ceiling of how many operations we can do. So, our encrypted value modulo the key divided by the key can be seen as where we are at by percentage towards our error ceiling.

To avoid hitting the problem of error getting too high too quickly and limiting your calculation count too much you can increase the key size. When you do that you’ll then run out of bits in your fixed size integer storage faster. To avoid THAT problem you can use “multi precision math libraries” to allow your integers to use an arbitrary number of bytes. This is what many real crypto algorithms use when they need to deal with very large numbers.

AND as Multiplication

Similar to the above, when you multiply two numbers and take a modulus of the result, it’s the same as if you took the modulus of the two numbers, multiplied that, and then took the modulus of the result.

In other words, when you multiply two numbers, you can think of it as also multiplying their residue (remainder).

Using the first example numbers from above:

15 * 28 = 420 \newline \newline ((15 \% 13) * (28 \% 13)) \% 13 = 420 \% 13 \newline (2 * 2) \% 13 = 4 \newline 4 = 4

And the second:

28 * 47 = 1316 \newline \newline ((28 \% 8) * (47 \% 8)) \% 8 = 1316 \% 8 \newline (4 * 7) \% 8 = 4 \newline 4 = 4

A bit of a coincidence that they both worked out to 4 this time 😛

Similar to XOR being the same as 1 bit addition, 1 bit multiplication is actually the same as AND, check it out:

0 * 0 = 0 \newline 0 * 1 = 0 \newline 1 * 0 = 0 \newline 1 * 1 = 1

Since AND multiplies residue, and XOR adds residue, and residue is what limits our homomorphic instruction count, you can see that AND is a more expensive operation compared to XOR, since it eats into our instruction budget a lot faster.

Error In Action

To see why rolling over is a problem, let’s say that our key is 9 and we want to XOR two encrypted bits 8 and 1, which represent 0 and 1 respectively.

To do an XOR, we add them together: 8 + 1 = 9.

Now, when we decrypt it we do this: (9 % 9) % 2 = 0

That result tells us that 0 XOR 1 is 0, which is incorrect! Our residue got too large and we hit the ceiling of our homomorphic instruction budget.

If the first bit was 6 instead of 8, the result of the XOR would have been 7, and (7 % 9) % 2 comes out to 1. That re-affirms to us that if we are under the error budget, we are good to go, but if our residue gets too large, we will have problems!

Sample Code

// Note that this encryption scheme is insecure so please don't actually use it
// in production!  A false bit with a given key is the same value every time, and
// so is a true bit.  Also, the encrypted true bit value will always be the
// encrypted false bit plus 1.  Even worse, an encrypted false bit is the key itself!
// This is just for demonstration purposes to see how the basics of homomorphic
// encryption work.  The next blog post will increase security.

#include <stdio.h>
#include <stdint.h>
#include <random>
#include <array>
#include <inttypes.h>

typedef uint64_t uint64;

// Increase this value to increase the size of the key, and also the maximum
// size of the error allowed.
// If you set it too high, operations will fail when they run out of storage space
// in the 64 bit ints.  If you set it too low, you will not be able to do very many
// operations in a row.
const size_t c_numKeyBits = 6;

#define Assert(x) if (!(x)) ((int*)nullptr)[0] = 0;

//=================================================================================
// TODO: Replace with something crypto secure if desired!
uint64 RandomUint64 (uint64 min, uint64 max)
{
    static std::random_device rd;
    static std::mt19937 gen(rd());
    std::uniform_int_distribution<uint64> dis(min, max);
    return dis(gen);
}

//=================================================================================
void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}

//=================================================================================
uint64 GenerateKey ()
{
    // Generate an odd random number in [2^(N-1), 2^N)
    // N is the number of bits in our key
    // The key also defines the maximum amount of error allowed, and thus the number
    // of operations allowed in a row.
    return RandomUint64(0, (1 << c_numKeyBits) - 1) | 1 | (1 << (c_numKeyBits - 1));
}

//=================================================================================
bool Decrypt (uint64 key, uint64 value)
{
    return ((value % key) % 2) == 1;
}

//=================================================================================
uint64 Encrypt (uint64 key, bool value)
{
    uint64 ret = key + (value ? 1 : 0);
    Assert(Decrypt(key, ret) == value);
    return ret;
}

//=================================================================================
uint64 XOR (uint64 A, uint64 B)
{
    return A + B;
}

//=================================================================================
uint64 AND (uint64 A, uint64 B)
{
    return A * B;
}

//=================================================================================
int GetErrorPercent (uint64 key, uint64 value)
{
    // Returns what % of maximum error this value has in it.  When error >= 100%
    // then we have hit our limit and start getting wrong answers.
    return int(100.0f * float(value % key) / float(key));
}

//=================================================================================
uint64 FullAdder (uint64 A, uint64 B, uint64 &carryBit)
{
    // homomorphically add the encrypted bits A and B
    // return the single bit sum, and put the carry bit into carryBit
    // From http://en.wikipedia.org/w/index.php?title=Adder_(electronics)&oldid=381607326#Full_adder
    uint64 sumBit = XOR(XOR(A, B), carryBit);
    carryBit = XOR(AND(A, B), AND(carryBit, XOR(A, B)));
    return sumBit;
}

//=================================================================================
int main (int argc, char **argv)
{
    // run this test a bunch to show that it works.  If you get a divide by zero
    // in an Assert, that means that it failed, and hopefully it's because you
    // increased c_numKeyBits to be too large!
    printf("Verifying 10000 truth tables.  Details of first one:n");
    for (int index = 0; index < 10000; ++index)
    {
        // make our key and a true and false bit
        uint64 key = GenerateKey();
        uint64 falseBit = Encrypt(key, false);
        uint64 trueBit = Encrypt(key, true);

        // Verify truth tables for XOR and AND
        Assert(Decrypt(key, XOR(falseBit, falseBit)) == false);
        Assert(Decrypt(key, XOR(falseBit, trueBit )) == true );
        Assert(Decrypt(key, XOR(trueBit , falseBit)) == true );
        Assert(Decrypt(key, XOR(trueBit , trueBit )) == false);

        Assert(Decrypt(key, AND(falseBit, falseBit)) == false);
        Assert(Decrypt(key, AND(falseBit, trueBit )) == false);
        Assert(Decrypt(key, AND(trueBit , falseBit)) == false);
        Assert(Decrypt(key, AND(trueBit , trueBit )) == true );

        // report the results for the first iteration of the loop
        if (index == 0)
        {
            printf("Key 0x%" PRIx64 ", false 0x%" PRIx64 ", true 0x%" PRIx64 "n", key, falseBit, trueBit);
            printf("  [0 xor 0] = 0   0x%" PRIx64 " xor(+) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%i%%)n", falseBit, falseBit, XOR(falseBit, falseBit), Decrypt(key, XOR(falseBit, falseBit)) ? 1 : 0, GetErrorPercent(key, XOR(falseBit, falseBit)));
            printf("  [0 xor 1] = 1   0x%" PRIx64 " xor(+) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%i%%)n", falseBit, trueBit , XOR(falseBit, trueBit ), Decrypt(key, XOR(falseBit, trueBit )) ? 1 : 0, GetErrorPercent(key, XOR(falseBit, trueBit )));
            printf("  [1 xor 0] = 1   0x%" PRIx64 " xor(+) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%i%%)n", trueBit , falseBit, XOR(trueBit , falseBit), Decrypt(key, XOR(trueBit , falseBit)) ? 1 : 0, GetErrorPercent(key, XOR(trueBit , falseBit)));
            printf("  [1 xor 1] = 0   0x%" PRIx64 " xor(+) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%i%%)n", trueBit , trueBit , XOR(trueBit , trueBit ), Decrypt(key, XOR(trueBit , trueBit )) ? 1 : 0, GetErrorPercent(key, XOR(trueBit , trueBit )));
            printf("  [0 and 0] = 0   0x%" PRIx64 " and(*) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%i%%)n", falseBit, falseBit, AND(falseBit, falseBit), Decrypt(key, AND(falseBit, falseBit)) ? 1 : 0, GetErrorPercent(key, XOR(falseBit, falseBit)));
            printf("  [0 and 1] = 0   0x%" PRIx64 " and(*) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%i%%)n", falseBit, trueBit , AND(falseBit, trueBit ), Decrypt(key, AND(falseBit, trueBit )) ? 1 : 0, GetErrorPercent(key, XOR(falseBit, trueBit )));
            printf("  [1 and 0] = 0   0x%" PRIx64 " and(*) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%i%%)n", trueBit , falseBit, AND(trueBit , falseBit), Decrypt(key, AND(trueBit , falseBit)) ? 1 : 0, GetErrorPercent(key, XOR(trueBit , falseBit)));
            printf("  [1 and 1] = 1   0x%" PRIx64 " and(*) 0x%" PRIx64 " = 0x%" PRIx64 " (%i err=%i%%)n", trueBit , trueBit , AND(trueBit , trueBit ), Decrypt(key, AND(trueBit , trueBit )) ? 1 : 0, GetErrorPercent(key, XOR(trueBit , trueBit )));
        }
    }

    // Do multi bit addition as an example of using compound circuits to
    // do meaningful work.
    const size_t c_numBitsAdded = 5;
    printf("nDoing 10000 Multibit Additions.  Details of first one:n");
    std::array<uint64, c_numBitsAdded> numberAEncrypted;
    std::array<uint64, c_numBitsAdded> numberBEncrypted;
    std::array<uint64, c_numBitsAdded> resultEncrypted;
    for (int index = 0; index < 10000; ++index)
    {
        // generate the numbers we want to add
        uint64 numberA = RandomUint64(0, (1 << c_numBitsAdded) - 1);
        uint64 numberB = RandomUint64(0, (1 << c_numBitsAdded) - 1);

        // generate our key
        uint64 key = GenerateKey();

        // encrypt our bits
        for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
        {
            numberAEncrypted[bitIndex] = Encrypt(key, (numberA & (uint64(1) << uint64(bitIndex))) != 0);
            numberBEncrypted[bitIndex] = Encrypt(key, (numberB & (uint64(1) << uint64(bitIndex))) != 0);
        }

        // do our multi bit addition!
        // we could initialize the carry bit to 0 or the encrypted value of 0. either one works since 0 and 1
        // are also poor encryptions of 0 and 1 in this scheme!
        uint64 carryBit = Encrypt(key, false);
        for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
            resultEncrypted[bitIndex] = FullAdder(numberAEncrypted[bitIndex], numberBEncrypted[bitIndex], carryBit);

        // decrypt our result
        uint64 resultDecrypted = 0;
        for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
        {
            if (Decrypt(key, resultEncrypted[bitIndex]))
                resultDecrypted |= uint64(1) << uint64(bitIndex);
        }

        // make sure that the results match, keeping in mind that the 4 bit encryption may have rolled over
        Assert(resultDecrypted == ((numberA + numberB) % (1 << c_numBitsAdded)));

        // report the results for the first iteration of the loop
        if (index == 0)
        {
            printf("Key 0x%" PRIx64 ", %" PRId64 " + %" PRId64 " in %i bits = %" PRId64 "n", key, numberA, numberB, c_numBitsAdded, (numberA + numberB) % (1 << c_numBitsAdded));
            for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
                printf("  A[%i] = 0x%" PRIx64 " (%i err=%i%%)n", bitIndex, numberAEncrypted[bitIndex], Decrypt(key, numberAEncrypted[bitIndex]), GetErrorPercent(key, numberAEncrypted[bitIndex]));
            printf("+n");
            for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
                printf("  B[%i] = 0x%" PRIx64 " (%i err=%i%%)n", bitIndex, numberBEncrypted[bitIndex], Decrypt(key, numberBEncrypted[bitIndex]), GetErrorPercent(key, numberBEncrypted[bitIndex]));
            printf("=n");
            for (int bitIndex = 0; bitIndex < c_numBitsAdded; ++bitIndex)
                printf("  Result[%i] = 0x%" PRIx64 " (%i err=%i%%)n", bitIndex, resultEncrypted[bitIndex], Decrypt(key, resultEncrypted[bitIndex]), GetErrorPercent(key, resultEncrypted[bitIndex]));
            printf("result decrypted = %" PRId64 "n", resultDecrypted);
        }
    }

    WaitForEnter();
    return 0;
}

Here is the output of a run of the program:

What if I Need Constants?!

If you are thinking how you might actually use this code in a real setting, you might be thinking to yourself “it’s great to be able to multiply two encrypted numbers together, but what if I just need to multiply them by a constant like 43?”

Well, interestingly, you can literally just use 0 and 1 in this scheme as constants to perform operations against the encrypted bits.

The reason that this works is that you can see 0 and 1 as just very poor encryptions 😛

(0 % KEY) % 2 = 0
(1 % KEY) % 2 = 1

As long as KEY is >= 2, the above is always true, no matter what the key actually is!

So there you go, add your own constants into the calculations all you want. They also happen to have very low residue/error (actually, they have the least amount possible!), so are much more friendly to use, versus having someone provide you with an encrypted table of constants to use in your calculations. It’s also more secure for the person doing the encrypting for them to provide you less encrypted data that you know the plain text for. It limits your (and anyone else’s) ability to do a known plain text attack.

The Other Shoe Drops

You might notice that in our scheme, given the same key, every true bit will be the same value, and every false bit will be the same value. Unfortunately, the true bit is also always the false bit + 1. As an attacker, this means that once you have seen both a true bit and a false bit, you will then have broken the encryption.

Even worse, when you encrypt a false bit, it gives you back the key itself!

We’ll improve that in the next post by adding a few more simple operations to the encryption process.

This leveled HE encryption scheme comes directly from the paper below. If you want to give it a look, what we covered is only part of the first two pages!
Fully Homomorphic Encryption over the Integers

The links below are where I started reading up on HE. They go a different route with FHE that you might find interesting, and also have a lot more commentary about the usage cases of HE:
The Swiss Army Knife of Cryptography
Building the Swiss Army Knife

In the scheme in those links, I haven’t figured out how multiplication is supposed to work yet (or bootstrapping, but one thing at a time). If you figure it out, let me know!

Gaussian Blur

In this post we are going to take the concepts we went over in the last post (Box Blur) and apply them to Gaussian blurring.

At a high level, Gaussian blurring works just like box blurring in that there is a weight per pixel and that for each pixel, you apply the weights to that pixel and it’s neighbors to come up with the final value for the blurred pixel.

With true Gaussian blurring however, the function that defines the weights for each pixel technically never reaches zero, but gets smaller and smaller over distance. In theory this makes a Gaussian kernel infinitely large. In practice though, you can choose a cut off point and call it good enough.

The parameters to a Gaussian blur are:

  • Sigma (\sigma) – This defines how much blur there is. A larger number is a higher amount of blur.
  • Radius – The size of the kernel in pixels. The appropriate pixel size can be calculated for a specific sigma, but more information on that lower down.

Just like a box blur, a Gaussian blur is separable which means that you can either apply a 2d convolution kernel, or you can apply a 1d convolution kernel on each axis. Doing a single 2d convolution means more calculations, but you only need one buffer to put the results into. Doing two 1d convolutions (one on each axis), ends up being fewer calculations, but requires two buffers to put the results into (one intermediate buffer to hold the first axis results).

Here is a 3 pixel 1d Gaussian Kernel for a sigma of 1.0:

Below is a 3×3 pixel 2d Gaussian Kernel also with a sigma of 1.0. Note that this can be calculated as an outer product (tensor product) of 1d kernels!

An interesting property of Gaussian blurs is that you can apply multiple smaller blurs and it will come up with the result as if you did a larger Blur. Unfortunately it’s more calculations doing multiple smaller blurs so is not usually worth while.

If you apply multiple blurs, the equivalent blur is the square root of the sum of the squares of the blur. Taking wikipedia’s example, if you applied a blur with radius 6 and a blur with a radius of 8, you’d end up with the equivelant of a radius 10 blur. This is because \sqrt{6^2 + 8^2} = 10.

Calculating The Kernel

There are a couple ways to calculate a Gaussian kernel.

Believe it or not, Pascal’s triangle approaches the Gaussian bell curve as the row number reaches infinity. If you remember, Pascal’s triangle also represents the numbers that each term is calculated by after expanding binomials (x+y)^N. So technically, you could use a row from Pascal’s triangle as a 1d kernel and normalize the result, but it isn’t the most accurate.

A better way is to use the Gaussian function which is this:

e^{-x^2/(2*\sigma^2)}

Where the sigma is your blur amount and x ranges across your values from the negative to the positive. For instance if your kernel was 5 values, it would range from -2 to +2.

An even better way would be to integrate the Gaussian function instead of just taking point samples. You can read about it in the link at the bottom “Gaussian Kernel Calculator”, but it’s also what we do in the example code.

Whatever way you do it, make sure and normalize the result so that the weights add up to 1. This makes sure that your blurring doesn’t make the image get brighter (greater than 1) or dimmer (less than 1).

Calculating The Kernel Size

Given a sigma value, you can calculate the size of the kernel you need by using this formula:

1+2 \sqrt{-2 \sigma^2 \ln 0.005}

That formula makes a Kernel large enough such that it cuts off when the value in the kernel is less than 0.5%. You can adjust the number in there to higher or lower depending on your desires for speed versus quality.

Examples

Once again, here is the unaltered image we are working with:

Here is the image blurred with a sigma of 3,3 (3 on the x axis and 3 on the y axis):

Here is the image blurred with a sigma of 20,3:

Here is the image blurred with a sigma of 50,50:

Shadertoy

Here’s a shadertoy implementing Gaussian Blur: Shadertoy:DF Gaussian Blur

Code

Here’s the source code I used to blur the examples above:

#define _CRT_SECURE_NO_WARNINGS
 
#include <stdio.h>
#include <stdint.h>
#include <array>
#include <vector>
#include <functional>
#include <windows.h>  // for bitmap headers.  Sorry non windows people!
 
typedef uint8_t uint8;

const float c_pi = 3.14159265359f;

struct SImageData
{
    SImageData()
        : m_width(0)
        , m_height(0)
    { }
 
    long m_width;
    long m_height;
    long m_pitch;
    std::vector<uint8> m_pixels;
};
 
void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}
 
bool LoadImage (const char *fileName, SImageData& imageData)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "rb");
    if (!file)
        return false;
 
    // read the headers if we can
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
    if (fread(&header, sizeof(header), 1, file) != 1 ||
        fread(&infoHeader, sizeof(infoHeader), 1, file) != 1 ||
        header.bfType != 0x4D42 || infoHeader.biBitCount != 24)
    {
        fclose(file);
        return false;
    }
 
    // read in our pixel data if we can. Note that it's in BGR order, and width is padded to the next power of 4
    imageData.m_pixels.resize(infoHeader.biSizeImage);
    fseek(file, header.bfOffBits, SEEK_SET);
    if (fread(&imageData.m_pixels[0], imageData.m_pixels.size(), 1, file) != 1)
    {
        fclose(file);
        return false;
    }
 
    imageData.m_width = infoHeader.biWidth;
    imageData.m_height = infoHeader.biHeight;
 
    imageData.m_pitch = imageData.m_width*3;
    if (imageData.m_pitch & 3)
    {
        imageData.m_pitch &= ~3;
        imageData.m_pitch += 4;
    }
 
    fclose(file);
    return true;
}
 
bool SaveImage (const char *fileName, const SImageData &image)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "wb");
    if (!file)
        return false;
 
    // make the header info
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
 
    header.bfType = 0x4D42;
    header.bfReserved1 = 0;
    header.bfReserved2 = 0;
    header.bfOffBits = 54;
 
    infoHeader.biSize = 40;
    infoHeader.biWidth = image.m_width;
    infoHeader.biHeight = image.m_height;
    infoHeader.biPlanes = 1;
    infoHeader.biBitCount = 24;
    infoHeader.biCompression = 0;
    infoHeader.biSizeImage = image.m_pixels.size();
    infoHeader.biXPelsPerMeter = 0;
    infoHeader.biYPelsPerMeter = 0;
    infoHeader.biClrUsed = 0;
    infoHeader.biClrImportant = 0;
 
    header.bfSize = infoHeader.biSizeImage + header.bfOffBits;
 
    // write the data and close the file
    fwrite(&header, sizeof(header), 1, file);
    fwrite(&infoHeader, sizeof(infoHeader), 1, file);
    fwrite(&image.m_pixels[0], infoHeader.biSizeImage, 1, file);
    fclose(file);
    return true;
}

int PixelsNeededForSigma (float sigma)
{
    // returns the number of pixels needed to represent a gaussian kernal that has values
    // down to the threshold amount.  A gaussian function technically has values everywhere
    // on the image, but the threshold lets us cut it off where the pixels contribute to
    // only small amounts that aren't as noticeable.
    const float c_threshold = 0.005f; // 0.5%
    return int(floor(1.0f + 2.0f * sqrtf(-2.0f * sigma * sigma * log(c_threshold)))) + 1;
}

float Gaussian (float sigma, float x)
{
    return expf(-(x*x) / (2.0f * sigma*sigma));
}

float GaussianSimpsonIntegration (float sigma, float a, float b)
{
    return
        ((b - a) / 6.0f) *
        (Gaussian(sigma, a) + 4.0f * Gaussian(sigma, (a + b) / 2.0f) + Gaussian(sigma, b));
}

std::vector<float> GaussianKernelIntegrals (float sigma, int taps)
{
    std::vector<float> ret;
    float total = 0.0f;
    for (int i = 0; i < taps; ++i)
    {
        float x = float(i) - float(taps / 2);
        float value = GaussianSimpsonIntegration(sigma, x - 0.5f, x + 0.5f);
        ret.push_back(value);
        total += value;
    }
    // normalize it
    for (unsigned int i = 0; i < ret.size(); ++i)
    {
        ret[i] /= total;
    }
    return ret;
}

const uint8* GetPixelOrBlack (const SImageData& image, int x, int y)
{
    static const uint8 black[3] = { 0, 0, 0 };
    if (x < 0 || x >= image.m_width ||
        y < 0 || y >= image.m_height)
    {
        return black;
    }

    return &image.m_pixels[(y * image.m_pitch) + x * 3];
}

void BlurImage (const SImageData& srcImage, SImageData &destImage, float xblursigma, float yblursigma, unsigned int xblursize, unsigned int yblursize)
{
    // allocate space for copying the image for destImage and tmpImage
    destImage.m_width = srcImage.m_width;
    destImage.m_height = srcImage.m_height;
    destImage.m_pitch = srcImage.m_pitch;
    destImage.m_pixels.resize(destImage.m_height * destImage.m_pitch);

    SImageData tmpImage;
    tmpImage.m_width = srcImage.m_width;
    tmpImage.m_height = srcImage.m_height;
    tmpImage.m_pitch = srcImage.m_pitch;
    tmpImage.m_pixels.resize(tmpImage.m_height * tmpImage.m_pitch);

    // horizontal blur from srcImage into tmpImage
    {
        auto row = GaussianKernelIntegrals(xblursigma, xblursize);

        int startOffset = -1 * int(row.size() / 2);

        for (int y = 0; y < tmpImage.m_height; ++y)
        {
            for (int x = 0; x < tmpImage.m_width; ++x)
            {
                std::array<float, 3> blurredPixel = { 0.0f, 0.0f, 0.0f };
                for (unsigned int i = 0; i < row.size(); ++i)
                {
                    const uint8 *pixel = GetPixelOrBlack(srcImage, x + startOffset + i, y);
                    blurredPixel[0] += float(pixel[0]) * row[i];
                    blurredPixel[1] += float(pixel[1]) * row[i];
                    blurredPixel[2] += float(pixel[2]) * row[i];
                }
                
                uint8 *destPixel = &tmpImage.m_pixels[y * tmpImage.m_pitch + x * 3];

                destPixel[0] = uint8(blurredPixel[0]);
                destPixel[1] = uint8(blurredPixel[1]);
                destPixel[2] = uint8(blurredPixel[2]);
            }
        }
    }

    // vertical blur from tmpImage into destImage
    {
        auto row = GaussianKernelIntegrals(yblursigma, yblursize);

        int startOffset = -1 * int(row.size() / 2);

        for (int y = 0; y < destImage.m_height; ++y)
        {
            for (int x = 0; x < destImage.m_width; ++x)
            {
                std::array<float, 3> blurredPixel = { 0.0f, 0.0f, 0.0f };
                for (unsigned int i = 0; i < row.size(); ++i)
                {
                    const uint8 *pixel = GetPixelOrBlack(tmpImage, x, y + startOffset + i);
                    blurredPixel[0] += float(pixel[0]) * row[i];
                    blurredPixel[1] += float(pixel[1]) * row[i];
                    blurredPixel[2] += float(pixel[2]) * row[i];
                }

                uint8 *destPixel = &destImage.m_pixels[y * destImage.m_pitch + x * 3];

                destPixel[0] = uint8(blurredPixel[0]);
                destPixel[1] = uint8(blurredPixel[1]);
                destPixel[2] = uint8(blurredPixel[2]);
            }
        }
    }
}

int main (int argc, char **argv)
{
    float xblursigma, yblursigma;
 
    bool showUsage = argc < 5 ||
        (sscanf(argv[3], "%f", &xblursigma) != 1) ||
        (sscanf(argv[4], "%f", &yblursigma) != 1);
 
    char *srcFileName = argv[1];
    char *destFileName = argv[2];
 
    if (showUsage)
    {
        printf("Usage: <source> <dest> <xblur> <yblur>nBlur values are sigmann");
        WaitForEnter();
        return 1;
    }
    
    // calculate pixel sizes, and make sure they are odd 
    int xblursize = PixelsNeededForSigma(xblursigma) | 1;
    int yblursize = PixelsNeededForSigma(yblursigma) | 1;

    printf("Attempting to blur a 24 bit image.n");
    printf("  Source=%sn  Dest=%sn  blur=[%0.1f, %0.1f] px=[%d,%d]nn", srcFileName, destFileName, xblursigma, yblursigma, xblursize, yblursize);
 
    SImageData srcImage;
    if (LoadImage(srcFileName, srcImage))
    {
        printf("%s loadedn", srcFileName);
        SImageData destImage;
        BlurImage(srcImage, destImage, xblursigma, yblursigma, xblursize, yblursize);
        if (SaveImage(destFileName, destImage))
            printf("Blurred image saved as %sn", destFileName);
        else
        {
            printf("Could not save blurred image as %sn", destFileName);
            WaitForEnter();
            return 1;
        }
    }
    else
    {
        printf("could not read 24 bit bmp file %snn", srcFileName);
        WaitForEnter();
        return 1;
    }
    return 0;
}

Links

Here is a really great explanation of the Gaussian blur.
Gaussian Blur – Image processing for scientists and engineers, Part 4
I highly recommend reading the 6 part series about image processing (DSP) from the beginning because it’s really informative and very easy to read!
Images are data – Image processing for scientists and engineers, Part 1

The Gaussian Kernel
Gaussian Kernel Calculator
DSP Stack Exchange: Gaussian Blur – standard deviation, radius and kernel size
Wikipedia: Gaussian blur

If you want to take this from theory / hobby level up to pro level, give this link a read from intel:
Intel: An investigation of fast real-time GPU-based image blur algorithms

Box Blur

If you ever have heard the terms “Box Blur”, “Boxcar Function”, “Box Filter”, “Boxcar Integrator” or other various combinations of those words, you may have thought it was some advanced concept that is hard to understand and hard to implement. If that’s what you thought, prepare to be surprised!

A box filter is nothing more than taking N samples of data (or NxN samples of data, or NxNxN etc) and averaging them! Yes, that is all there is to it 😛

In this post, we are going to implement a box blur by averaging pixels.

1D Case

For the case of a 1d box filter, let’s say we wanted every data point to be the result of averaging it with it’s two neighbors. It’d be easy enough to program that by just doing it, but let’s look at it a different way. What weight would we need to multiply each of the three values by (the value and it’s two neighbors) to make it come up with the average?

Yep, you guessed it! For every data value, you multiply it and it’s neighbors by 1/3 to come up with the average value. We could easily increase the size of the filter to 5 pixels, and multiply each pixel by 1/5 instead. We could continue the pattern as high as we wanted.

One thing you might notice is that if we want a buffer with all the results, we can’t just alter the source data as we go, because we want the unaltered source values of the data to use those weights with, to get the correct results. Because of that, we need to make a second buffer to put the results of the filtering into.

Believe it or not, that diagram above is a convolution kernel, and how we talked about applying it is how you do convolution in 1d! It just so happens that this convolution kernel averages three pixels into one, which also happens to provide a low pass filter type effect.

Low pass filtering is what is done before down sampling audio data to prevent aliasing (frequencies higher than the sample rate can handle, which makes audio sound bad).

Surprise… blurring can also be seen as low pass filtering, which is something you can do before scaling an image down in size, to prevent aliasing.

2D Case

The 2d case isn’t much more difficult to understand than the 1d case. Instead of only averaging on one axis, we average on two instead:

Something interesting to note is that you can either use this 3×3 2d convolution kernel, or, you could apply the 1d convolution kernel described above on the X axis and then the Y axis. The methods are mathematically equivalent.

Using the 2d convolution kernel would result in 9 multiplications per pixel, but if going with the separated axis X and then Y 1d kernel, you’d only end up doing 6 multiplications per pixel (3 multiplications per axis). In general, if you have a seperable 2d convolution kernel (meaning that you can break it into a per axis 1d convolution), you will end up doing N^2 multiplications when using the 2d kernel, versus N*2 multiplications when using the 1d kernels. You can see that this would add up quickly in favor of using 1d kernels, but unfortunately not all kernels are separable.

Doing two passes does come at a cost though. Since you have to use a temporary buffer for each pass, you end up having to create two temporary buffers instead of one.

You can build 2d kernels from 1d kernels by multiplying them as a row vector, by a column vector. For instance, you can see how multiplying the (1/3,1/3,1/3) kernel by itself as a column vector would create the 2nd kernel, that is 3×3 and has 1/9 in every spot.

The resulting 3×3 matrix is called an outer product, or a tensor product. Something interesting to note is that you don’t have to do the same operation on each axis!

Examples

Here are some examples of box blurring with different values, using the sample code provided below.

The source image:

Now blurred by a 10×10 box car convolution kernel:

Now blurred by a 100×10 box car convolution kernel:

Shadertoy

You can find a shadertoy implementation of box blurring here: Shadertoy:DF Box Blur

Code

Here’s the code I used to blur the example images above:

#define _CRT_SECURE_NO_WARNINGS
  
#include <stdio.h>
#include <stdint.h>
#include <array>
#include <vector>
#include <functional>
#include <windows.h>  // for bitmap headers.  Sorry non windows people!
  
typedef uint8_t uint8;
 
const float c_pi = 3.14159265359f;
 
struct SImageData
{
    SImageData()
        : m_width(0)
        , m_height(0)
    { }
  
    long m_width;
    long m_height;
    long m_pitch;
    std::vector<uint8> m_pixels;
};
  
void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}
  
bool LoadImage (const char *fileName, SImageData& imageData)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "rb");
    if (!file)
        return false;
  
    // read the headers if we can
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
    if (fread(&header, sizeof(header), 1, file) != 1 ||
        fread(&infoHeader, sizeof(infoHeader), 1, file) != 1 ||
        header.bfType != 0x4D42 || infoHeader.biBitCount != 24)
    {
        fclose(file);
        return false;
    }
  
    // read in our pixel data if we can. Note that it's in BGR order, and width is padded to the next power of 4
    imageData.m_pixels.resize(infoHeader.biSizeImage);
    fseek(file, header.bfOffBits, SEEK_SET);
    if (fread(&imageData.m_pixels[0], imageData.m_pixels.size(), 1, file) != 1)
    {
        fclose(file);
        return false;
    }
  
    imageData.m_width = infoHeader.biWidth;
    imageData.m_height = infoHeader.biHeight;
  
    imageData.m_pitch = imageData.m_width*3;
    if (imageData.m_pitch & 3)
    {
        imageData.m_pitch &= ~3;
        imageData.m_pitch += 4;
    }
  
    fclose(file);
    return true;
}
  
bool SaveImage (const char *fileName, const SImageData &image)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "wb");
    if (!file)
        return false;
  
    // make the header info
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
  
    header.bfType = 0x4D42;
    header.bfReserved1 = 0;
    header.bfReserved2 = 0;
    header.bfOffBits = 54;
  
    infoHeader.biSize = 40;
    infoHeader.biWidth = image.m_width;
    infoHeader.biHeight = image.m_height;
    infoHeader.biPlanes = 1;
    infoHeader.biBitCount = 24;
    infoHeader.biCompression = 0;
    infoHeader.biSizeImage = image.m_pixels.size();
    infoHeader.biXPelsPerMeter = 0;
    infoHeader.biYPelsPerMeter = 0;
    infoHeader.biClrUsed = 0;
    infoHeader.biClrImportant = 0;
  
    header.bfSize = infoHeader.biSizeImage + header.bfOffBits;
  
    // write the data and close the file
    fwrite(&header, sizeof(header), 1, file);
    fwrite(&infoHeader, sizeof(infoHeader), 1, file);
    fwrite(&image.m_pixels[0], infoHeader.biSizeImage, 1, file);
    fclose(file);
    return true;
}

const uint8* GetPixelOrBlack (const SImageData& image, int x, int y)
{
    static const uint8 black[3] = { 0, 0, 0 };
    if (x < 0 || x >= image.m_width ||
        y < 0 || y >= image.m_height)
    {
        return black;
    }
 
    return &image.m_pixels[(y * image.m_pitch) + x * 3];
}
 
void BlurImage (const SImageData& srcImage, SImageData &destImage, unsigned int xblur, unsigned int yblur)
{
    // allocate space for copying the image for destImage and tmpImage
    destImage.m_width = srcImage.m_width;
    destImage.m_height = srcImage.m_height;
    destImage.m_pitch = srcImage.m_pitch;
    destImage.m_pixels.resize(destImage.m_height * destImage.m_pitch);
 
    SImageData tmpImage;
    tmpImage.m_width = srcImage.m_width;
    tmpImage.m_height = srcImage.m_height;
    tmpImage.m_pitch = srcImage.m_pitch;
    tmpImage.m_pixels.resize(tmpImage.m_height * tmpImage.m_pitch);
 
    // horizontal blur from srcImage into tmpImage
    {
        float weight = 1.0f / float(xblur);
        int half = xblur / 2;
        for (int y = 0; y < tmpImage.m_height; ++y)
        {
            for (int x = 0; x < tmpImage.m_width; ++x)
            {
                std::array<float, 3> blurredPixel = { 0.0f, 0.0f, 0.0f };
                for (int i = -half; i <= half; ++i)
                {
                    const uint8 *pixel = GetPixelOrBlack(srcImage, x + i, y);
                    blurredPixel[0] += float(pixel[0]) * weight;
                    blurredPixel[1] += float(pixel[1]) * weight;
                    blurredPixel[2] += float(pixel[2]) * weight;
                }
                 
                uint8 *destPixel = &tmpImage.m_pixels[y * tmpImage.m_pitch + x * 3];
 
                destPixel[0] = uint8(blurredPixel[0]);
                destPixel[1] = uint8(blurredPixel[1]);
                destPixel[2] = uint8(blurredPixel[2]);
            }
        }
    }
 
    // vertical blur from tmpImage into destImage
    {
        float weight = 1.0f / float(yblur);
        int half = yblur / 2;
 
        for (int y = 0; y < destImage.m_height; ++y)
        {
            for (int x = 0; x < destImage.m_width; ++x)
            {
                std::array<float, 3> blurredPixel = { 0.0f, 0.0f, 0.0f };
                for (int i = -half; i <= half; ++i)
                {
                    const uint8 *pixel = GetPixelOrBlack(tmpImage, x, y + i);
                    blurredPixel[0] += float(pixel[0]) * weight;
                    blurredPixel[1] += float(pixel[1]) * weight;
                    blurredPixel[2] += float(pixel[2]) * weight;
                }
 
                uint8 *destPixel = &destImage.m_pixels[y * destImage.m_pitch + x * 3];
 
                destPixel[0] = uint8(blurredPixel[0]);
                destPixel[1] = uint8(blurredPixel[1]);
                destPixel[2] = uint8(blurredPixel[2]);
            }
        }
    }
}
 
int main (int argc, char **argv)
{
    int xblur, yblur;
  
    bool showUsage = argc < 5 ||
        (sscanf(argv[3], "%i", &xblur) != 1) ||
        (sscanf(argv[4], "%i", &yblur) != 1);
  
    char *srcFileName = argv[1];
    char *destFileName = argv[2];
  
    if (showUsage)
    {
        printf("Usage: <source> <dest> <xblur> <yblur>nn");
        WaitForEnter();
        return 1;
    }
     
    // make sure blur size is odd
    xblur = xblur | 1;
    yblur = yblur | 1;
 
    printf("Attempting to blur a 24 bit image.n");
    printf("  Source=%sn  Dest=%sn  blur=[%d,%d]nn", srcFileName, destFileName, xblur, yblur);
  
    SImageData srcImage;
    if (LoadImage(srcFileName, srcImage))
    {
        printf("%s loadedn", srcFileName);
        SImageData destImage;
        BlurImage(srcImage, destImage, xblur, yblur);
        if (SaveImage(destFileName, destImage))
            printf("Blurred image saved as %sn", destFileName);
        else
        {
            printf("Could not save blurred image as %sn", destFileName);
            WaitForEnter();
            return 1;
        }
    }
    else
    {
        printf("could not read 24 bit bmp file %snn", srcFileName);
        WaitForEnter();
        return 1;
    }
    return 0;
}

Next Up

Next up will be a Gaussian blur, and I’m nearly done w/ that post but wanted to make this one first as an introductory step!

Before we get there, I wanted to mention that if you do multiple box blurs in a row, it will start to approach Gaussian blurring. I’ve heard that three blurs in a row will make it basically indistinguishable from a Gaussian blur.

Resizing Images With Bicubic Interpolation

In the last post we saw how to do cubic interpolation on a grid of data.

Strangely enough, when that grid is a grid of pixel data, bicubic interpolation is a common method for resizing images!

Bicubic interpolation can also used in realtime rendering to make textures look nicer when scaled than standard bilinear texture interpolation.

This technique works when making images larger as well as smaller, but when making images smaller, you can still have problems with aliasing. There are are better algorithms to use when making an image smaller. Check the links section at the bottom for more details!

Example

Here’s the old man from The Legend of Zelda who gives you the sword.

Here he is scaled up 4x with nearest neighbor, bilinear interpolation and bicubic interpolation.


Here he is scaled up 16x with nearest neighbor, bilinear interpolation and bicubic interpolation.


Shadertoy

I made a shadertoy to show you how to do this in a GLSL pixel shader as well. Shadertoy: Bicubic Texture Filtering

In the screenshot below, going from left to right it uses: Nearest Neighbor, Bilinear, Lagrange Bicubic interpolation (only interpolates values, not slopes), Hermite Bicubic interpolation.

Sample Code

Here’s the code that I used to resize the images in the examples above.

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdint.h>
#include <array>
#include <vector>
#include <windows.h>  // for bitmap headers.  Sorry non windows people!

#define CLAMP(v, min, max) if (v < min) { v = min; } else if (v > max) { v = max; } 

typedef uint8_t uint8;

struct SImageData
{
    SImageData()
        : m_width(0)
        , m_height(0)
    { }

    long m_width;
    long m_height;
    long m_pitch;
    std::vector<uint8> m_pixels;
};

void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}

bool LoadImage (const char *fileName, SImageData& imageData)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "rb");
    if (!file)
        return false;

    // read the headers if we can
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
    if (fread(&header, sizeof(header), 1, file) != 1 ||
        fread(&infoHeader, sizeof(infoHeader), 1, file) != 1 ||
        header.bfType != 0x4D42 || infoHeader.biBitCount != 24)
    {
        fclose(file);
        return false;
    }

    // read in our pixel data if we can. Note that it's in BGR order, and width is padded to the next power of 4
    imageData.m_pixels.resize(infoHeader.biSizeImage);
    fseek(file, header.bfOffBits, SEEK_SET);
    if (fread(&imageData.m_pixels[0], imageData.m_pixels.size(), 1, file) != 1)
    {
        fclose(file);
        return false;
    }

    imageData.m_width = infoHeader.biWidth;
    imageData.m_height = infoHeader.biHeight;

    imageData.m_pitch = imageData.m_width*3;
    if (imageData.m_pitch & 3)
    {
        imageData.m_pitch &= ~3;
        imageData.m_pitch += 4;
    }

    fclose(file);
    return true;
}

bool SaveImage (const char *fileName, const SImageData &image)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "wb");
    if (!file)
        return false;

    // make the header info
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;

    header.bfType = 0x4D42;
    header.bfReserved1 = 0;
    header.bfReserved2 = 0;
    header.bfOffBits = 54;

    infoHeader.biSize = 40;
    infoHeader.biWidth = image.m_width;
    infoHeader.biHeight = image.m_height;
    infoHeader.biPlanes = 1;
    infoHeader.biBitCount = 24;
    infoHeader.biCompression = 0;
    infoHeader.biSizeImage = image.m_pixels.size();
    infoHeader.biXPelsPerMeter = 0;
    infoHeader.biYPelsPerMeter = 0;
    infoHeader.biClrUsed = 0;
    infoHeader.biClrImportant = 0;

    header.bfSize = infoHeader.biSizeImage + header.bfOffBits;

    // write the data and close the file
    fwrite(&header, sizeof(header), 1, file);
    fwrite(&infoHeader, sizeof(infoHeader), 1, file);
    fwrite(&image.m_pixels[0], infoHeader.biSizeImage, 1, file);
    fclose(file);
    return true;
}

// t is a value that goes from 0 to 1 to interpolate in a C1 continuous way across uniformly sampled data points.
// when t is 0, this will return B.  When t is 1, this will return C.  Inbetween values will return an interpolation
// between B and C.  A and B are used to calculate slopes at the edges.
float CubicHermite (float A, float B, float C, float D, float t)
{
    float a = -A / 2.0f + (3.0f*B) / 2.0f - (3.0f*C) / 2.0f + D / 2.0f;
    float b = A - (5.0f*B) / 2.0f + 2.0f*C - D / 2.0f;
    float c = -A / 2.0f + C / 2.0f;
    float d = B;

    return a*t*t*t + b*t*t + c*t + d;
}

float Lerp (float A, float B, float t)
{
    return A * (1.0f - t) + B * t;
}

const uint8* GetPixelClamped (const SImageData& image, int x, int y)
{
    CLAMP(x, 0, image.m_width - 1);
    CLAMP(y, 0, image.m_height - 1);    
    return &image.m_pixels[(y * image.m_pitch) + x * 3];
}

std::array<uint8, 3> SampleNearest (const SImageData& image, float u, float v)
{
    // calculate coordinates
    int xint = int(u * image.m_width);
    int yint = int(v * image.m_height);

    // return pixel
    auto pixel = GetPixelClamped(image, xint, yint);
    std::array<uint8, 3> ret;
    ret[0] = pixel[0];
    ret[1] = pixel[1];
    ret[2] = pixel[2];
    return ret;
}

std::array<uint8, 3> SampleLinear (const SImageData& image, float u, float v)
{
    // calculate coordinates -> also need to offset by half a pixel to keep image from shifting down and left half a pixel
    float x = (u * image.m_width) - 0.5f;
    int xint = int(x);
    float xfract = x - floor(x);

    float y = (v * image.m_height) - 0.5f;
    int yint = int(y);
    float yfract = y - floor(y);

    // get pixels
    auto p00 = GetPixelClamped(image, xint + 0, yint + 0);
    auto p10 = GetPixelClamped(image, xint + 1, yint + 0);
    auto p01 = GetPixelClamped(image, xint + 0, yint + 1);
    auto p11 = GetPixelClamped(image, xint + 1, yint + 1);

    // interpolate bi-linearly!
    std::array<uint8, 3> ret;
    for (int i = 0; i < 3; ++i)
    {
        float col0 = Lerp(p00[i], p10[i], xfract);
        float col1 = Lerp(p01[i], p11[i], xfract);
        float value = Lerp(col0, col1, yfract);
        CLAMP(value, 0.0f, 255.0f);
        ret[i] = uint8(value);
    }
    return ret;
}

std::array<uint8, 3> SampleBicubic (const SImageData& image, float u, float v)
{
    // calculate coordinates -> also need to offset by half a pixel to keep image from shifting down and left half a pixel
    float x = (u * image.m_width) - 0.5;
    int xint = int(x);
    float xfract = x - floor(x);

    float y = (v * image.m_height) - 0.5;
    int yint = int(y);
    float yfract = y - floor(y);

    // 1st row
    auto p00 = GetPixelClamped(image, xint - 1, yint - 1);
    auto p10 = GetPixelClamped(image, xint + 0, yint - 1);
    auto p20 = GetPixelClamped(image, xint + 1, yint - 1);
    auto p30 = GetPixelClamped(image, xint + 2, yint - 1);

    // 2nd row
    auto p01 = GetPixelClamped(image, xint - 1, yint + 0);
    auto p11 = GetPixelClamped(image, xint + 0, yint + 0);
    auto p21 = GetPixelClamped(image, xint + 1, yint + 0);
    auto p31 = GetPixelClamped(image, xint + 2, yint + 0);

    // 3rd row
    auto p02 = GetPixelClamped(image, xint - 1, yint + 1);
    auto p12 = GetPixelClamped(image, xint + 0, yint + 1);
    auto p22 = GetPixelClamped(image, xint + 1, yint + 1);
    auto p32 = GetPixelClamped(image, xint + 2, yint + 1);

    // 4th row
    auto p03 = GetPixelClamped(image, xint - 1, yint + 2);
    auto p13 = GetPixelClamped(image, xint + 0, yint + 2);
    auto p23 = GetPixelClamped(image, xint + 1, yint + 2);
    auto p33 = GetPixelClamped(image, xint + 2, yint + 2);

    // interpolate bi-cubically!
    // Clamp the values since the curve can put the value below 0 or above 255
    std::array<uint8, 3> ret;
    for (int i = 0; i < 3; ++i)
    {
        float col0 = CubicHermite(p00[i], p10[i], p20[i], p30[i], xfract);
        float col1 = CubicHermite(p01[i], p11[i], p21[i], p31[i], xfract);
        float col2 = CubicHermite(p02[i], p12[i], p22[i], p32[i], xfract);
        float col3 = CubicHermite(p03[i], p13[i], p23[i], p33[i], xfract);
        float value = CubicHermite(col0, col1, col2, col3, yfract);
        CLAMP(value, 0.0f, 255.0f);
        ret[i] = uint8(value);
    }
    return ret;
}

void ResizeImage (const SImageData &srcImage, SImageData &destImage, float scale, int degree)
{
    destImage.m_width = long(float(srcImage.m_width)*scale);
    destImage.m_height = long(float(srcImage.m_height)*scale);
    destImage.m_pitch = destImage.m_width * 3;
    if (destImage.m_pitch & 3)
    {
        destImage.m_pitch &= ~3;
        destImage.m_pitch += 4;
    }
    destImage.m_pixels.resize(destImage.m_pitch*destImage.m_height);

    uint8 *row = &destImage.m_pixels[0];
    for (int y = 0; y < destImage.m_height; ++y)
    {
        uint8 *destPixel = row;
        float v = float(y) / float(destImage.m_height - 1);
        for (int x = 0; x < destImage.m_width; ++x)
        {
            float u = float(x) / float(destImage.m_width - 1);
            std::array<uint8, 3> sample;

            if (degree == 0)
                sample = SampleNearest(srcImage, u, v);
            else if (degree == 1)
                sample = SampleLinear(srcImage, u, v);
            else if (degree == 2)
                sample = SampleBicubic(srcImage, u, v);

            destPixel[0] = sample[0];
            destPixel[1] = sample[1];
            destPixel[2] = sample[2];
            destPixel += 3;
        }
        row += destImage.m_pitch;
    }
}

int main (int argc, char **argv)
{
    float scale = 1.0f;
    int degree = 0;

    bool showUsage = argc < 5 ||
        (sscanf(argv[3], "%f", &scale) != 1) ||
        (sscanf(argv[4], "%i", &degree) != 1);

    char *srcFileName = argv[1];
    char *destFileName = argv[2];

    if (showUsage)
    {
        printf("Usage: <source> <dest> <scale> <degree>ndegree 0 = nearest, 1 = bilinear, 2 = bicubic.nn");
        WaitForEnter();
        return 1;
    }

    printf("Attempting to resize a 24 bit image.n");
    printf("  Source = %sn  Dest = %sn  Scale = %0.2fnn", srcFileName, destFileName, scale);

    SImageData srcImage;
    if (LoadImage(srcFileName, srcImage))
    {
        printf("%s loadedn", srcFileName);
        SImageData destImage;
        ResizeImage(srcImage, destImage, scale, degree);
        if (SaveImage(destFileName, destImage))
            printf("Resized image saved as %sn", destFileName);
        else
            printf("Could not save resized image as %sn", destFileName);
    }
    else
        printf("could not read 24 bit bmp file %snn", srcFileName);
    return 0;
}

Links

A small tutorial about how to load a bitmap file
The BMP Format
Reconstruction Filters in Computer Graphics

The link below talks about how to do cubic texture sampling on the GPU without having to do 16 texture reads!
GPU Gems 2 Chapter 20. Fast Third-Order Texture Filtering

This link is from Inigo Quilez, where he transforms a texture coordinate before passing it to the bilinear filtering, to get higher quality texture sampling without having to do extra texture reads. That is pretty cool.
IQ: improved texture interpolation

Cubic Hermite Rectangles

Time for another Frankenstein post. This time we are going to combine the following:

  1. Cubic Hermite Interpolation
  2. Rectangular Bezier Patches

The end result is going to be a Cubic Hermite Rectangle Surface like the below. Note that the curve only passes through the inner four control points, and the outer ring of 12 control points are used to determine the slope.

Just like the cubic hermite curve counterpart, a cubic hermite rectangle surface is C1 continuous everywhere, which is great for use as a way of modeling geometry, as well as just for interpolation of multidimensional data. In the image below, each checkerboard square is an individual hermite rectangle.

The links section at the bottom has links to the shadertoys I made that I got the screenshots from.

Code

Here’s some C++ code that does bicubic hermite interpolation

#include <stdio.h>
#include <array>
  
typedef std::array<float, 4> TFloat4;
typedef std::array<TFloat4, 4> TFloat4x4;
  
const TFloat4x4 c_ControlPointsX =
{
    {
        { 0.7f, 0.8f, 0.9f, 0.3f },
        { 0.2f, 0.5f, 0.4f, 0.1f },
        { 0.6f, 0.3f, 0.1f, 0.4f },
        { 0.8f, 0.4f, 0.2f, 0.7f },
    }
};
  
const TFloat4x4 c_ControlPointsY =
{
    {
        { 0.2f, 0.8f, 0.5f, 0.6f },
        { 0.6f, 0.9f, 0.3f, 0.8f },
        { 0.7f, 0.1f, 0.4f, 0.9f },
        { 0.6f, 0.5f, 0.3f, 0.2f },
    }
};
  
const TFloat4x4 c_ControlPointsZ =
{
    {
        { 0.6f, 0.5f, 0.3f, 0.2f },
        { 0.7f, 0.1f, 0.9f, 0.5f },
        { 0.8f, 0.4f, 0.2f, 0.7f },
        { 0.6f, 0.3f, 0.1f, 0.4f },
    }
};
  
void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}
  
// t is a value that goes from 0 to 1 to interpolate in a C1 continuous way across uniformly sampled data points.
// when t is 0, this will return p[1].  When t is 1, this will return p[2].
// p[0] and p[3] are used to calculate slopes at the edges.
float CubicHermite(const TFloat4& p, float t)
{
    float a = -p[0] / 2.0f + (3.0f*p[1]) / 2.0f - (3.0f*p[2]) / 2.0f + p[3] / 2.0f;
    float b = p[0] - (5.0f*p[1]) / 2.0f + 2.0f*p[2] - p[3] / 2.0f;
    float c = -p[0] / 2.0f + p[2] / 2.0f;
    float d = p[1];

    return a*t*t*t + b*t*t + c*t + d;
}
  
float BicubicHermitePatch(const TFloat4x4& p, float u, float v)
{
    TFloat4 uValues;
    uValues[0] = CubicHermite(p[0], u);
    uValues[1] = CubicHermite(p[1], u);
    uValues[2] = CubicHermite(p[2], u);
    uValues[3] = CubicHermite(p[2], u);
    return CubicHermite(uValues, v);
}
  
int main(int argc, char **argv)
{
    // how many values to display on each axis. Limited by console resolution!
    const int c_numValues = 4;
  
    printf("Cubic Hermite rectangle:n");
    for (int i = 0; i < c_numValues; ++i)
    {
        float iPercent = ((float)i) / ((float)(c_numValues - 1));
        for (int j = 0; j < c_numValues; ++j)
        {
            if (j == 0)
                printf("  ");
            float jPercent = ((float)j) / ((float)(c_numValues - 1));
            float valueX = BicubicHermitePatch(c_ControlPointsX, jPercent, iPercent);
            float valueY = BicubicHermitePatch(c_ControlPointsY, jPercent, iPercent);
            float valueZ = BicubicHermitePatch(c_ControlPointsZ, jPercent, iPercent);
            printf("(%0.2f, %0.2f, %0.2f) ", valueX, valueY, valueZ);
        }
        printf("n");
    }
    printf("n");
  
    WaitForEnter();
    return 0;
}

And here’s the output. Note that the four corners of the output correspond to the four inner most points defined in the data!

On The GPU / Links

While cubic Hermite rectangles pass through all of their control points like Lagrange surfaces do (and like Bezier rectangle’s don’t), they don’t suffer from Runge’s Phenomenon like Lagrange surfaces do.

However, just like Lagrange surfaces, Hermite surfaces don’t have the nice property that Bezier surfaces have, where the surface is guaranteed to stay inside of the convex hull defined by the control points.

Since Hermite surfaces are just cubic functions though, you could calculate the minimum and maximum value that they can reach using some calculus and come up with a bounding box by going that direction. The same thing is technically true of Lagrange surfaces as well for what it’s worth.

Check out the links below to see cubic Hermite rectangles rendered in real time in WebGL using raytracing and raymarching:
Shadertoy: Cubic Hermite Rectangle
Shadertoy: Infinite Hermite Rectangles

Cubic Hermite Interpolation

It’s a big wide world of curves out there and I have to say that most of the time, I consider myself a Bezier man.

Well let me tell you… cubic Hermite splines are technically representable in Bezier form, but they have some really awesome properties that I never fully appreciated until recently.

Usefulness For Interpolation

If you have a set of data points on some fixed interval (like for audio data, but could be anything), you can use a cubic Hermite spline to interpolate between any two data points. It interpolates the value between those points (as in, it passes through both end points), but it also interpolates a derivative that is consistent if you approach the point from the left or the right.

In short, this means you can use cubic Hermite splines to interpolate data such that the result has C1 continuity everywhere!

Usefulness As Curves

If you have any number N control points on a fixed interval, you can treat it as a bunch of piece wise cubic Hermite splines and evaluate it that way.

The end result is that you have a curve that is C1 continuous everywhere, it has local control (moving any control point only affects the two curve sections to the left and the two curve sections to the right), and best of all, the computational complexity doesn’t rise as you increase the number of control points!

The image below was taken as a screenshot from one of the HTML5 demos I made for you to play with. You can find links to them at the end of this post.

Cubic Hermite Splines

Cubic Hermite splines have four control points but how it uses the control points is a bit different than you’d expect.

The curve itself passes only through the middle two control points, and the end control points are there to help calculate the tangent at the middle control points.

Let’s say you have control points P_{-1}, P_0, P_1, P_2. The curve at time 0 will be at point P_0 and the slope will be the same slope as a line would have if going from P_{-1} to P_1. The curve at time 1 will be at point P_1 and the slope will be the same slope as a line would have if going from P_0 to P_2.

Check out the picture below to see what I mean visually.

That sounds like a strange set of properties, but they are actually super useful.

What this means is that you can treat any group of 4 control points / data points as a separate cubic hermite spline, but when you put it all together, it is a single smooth curve.

Note that you can either interpolate 1d data, or you can interpolate 2d data points by doing this interpolation on each axis. You could also use this to make a surface, which will likely be the next blog post!

The Math

I won’t go into how the formula is derived, but if you are interested you should check out Signal Processing: Bicubic Interpolation.

The formula is:

a*t^3+b*t^2+c*t+d

Where…

a = \frac{-P_{-1} + 3*P_0 - 3*P_1 + P_2}{2}
b = P_{-1} - \frac{5*P_0}{2} + 2*P_1 - \frac{P_2}{2}
c = \frac{-P_{-1} + P_1}{2}
d = P_0

Note that t is a value that goes from 0 to 1. When t is 0, your curve will be at P_1 and when t is 1, your curve will be at P_2. P_{-1} and P_{2} are used to be able to make this interpolation C1 continuous.

Here it is in some simple C++:

// t is a value that goes from 0 to 1 to interpolate in a C1 continuous way across uniformly sampled data points.
// when t is 0, this will return B.  When t is 1, this will return C.
static float CubicHermite (float A, float B, float C, float D, float t)
{
    float a = -A/2.0f + (3.0f*B)/2.0f - (3.0f*C)/2.0f + D/2.0f;
    float b = A - (5.0f*B)/2.0f + 2.0f*C - D / 2.0f;
    float c = -A/2.0f + C/2.0f;
    float d = B;

    return a*t*t*t + b*t*t + c*t + d;
}

Code

Here is an example C++ program that interpolates both 1D and 2D data.

#include <stdio.h>
#include <vector>
#include <array>
 
typedef std::vector<float> TPointList1D;
typedef std::vector<std::array<float,2>> TPointList2D;
 
void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}

// t is a value that goes from 0 to 1 to interpolate in a C1 continuous way across uniformly sampled data points.
// when t is 0, this will return B.  When t is 1, this will return C.
float CubicHermite (float A, float B, float C, float D, float t)
{
    float a = -A/2.0f + (3.0f*B)/2.0f - (3.0f*C)/2.0f + D/2.0f;
    float b = A - (5.0f*B)/2.0f + 2.0f*C - D / 2.0f;
    float c = -A/2.0f + C/2.0f;
    float d = B;
 
    return a*t*t*t + b*t*t + c*t + d;
}

template <typename T>
inline T GetIndexClamped(const std::vector<T>& points, int index)
{
    if (index < 0)
        return points[0];
    else if (index >= int(points.size()))
        return points.back();
    else
        return points[index];
}

int main (int argc, char **argv)
{
    const float c_numSamples = 13;

    // show some 1d interpolated values
    {
        const TPointList1D points =
        {
            0.0f,
            1.6f,
            2.3f,
            3.5f,
            4.3f,
            5.9f,
            6.8f
        };

        printf("1d interpolated values.  y = f(t)n");
        for (int i = 0; i < c_numSamples; ++i)
        {
            float percent = ((float)i) / (float(c_numSamples - 1));
            float x = (points.size()-1) * percent;

            int index = int(x);
            float t = x - floor(x);
            float A = GetIndexClamped(points, index - 1);
            float B = GetIndexClamped(points, index + 0);
            float C = GetIndexClamped(points, index + 1);
            float D = GetIndexClamped(points, index + 2);

            float y = CubicHermite(A, B, C, D, t);
            printf("  Value at %0.2f = %0.2fn", x, y);
        }
        printf("n");
    }

    // show some 2d interpolated values
    {
        const TPointList2D points =
        {
            { 0.0f, 1.1f },
            { 1.6f, 8.3f },
            { 2.3f, 6.5f },
            { 3.5f, 4.7f },
            { 4.3f, 3.1f },
            { 5.9f, 7.5f },
            { 6.8f, 0.0f }
        };

        printf("2d interpolated values.  x = f(t), y = f(t)n");
        for (int i = 0; i < c_numSamples; ++i)
        {
            float percent = ((float)i) / (float(c_numSamples - 1));
            float x = 0.0f;
            float y = 0.0f;

            float tx = (points.size() -1) * percent;
            int index = int(tx);
            float t = tx - floor(tx);

            std::array<float, 2> A = GetIndexClamped(points, index - 1);
            std::array<float, 2> B = GetIndexClamped(points, index + 0);
            std::array<float, 2> C = GetIndexClamped(points, index + 1);
            std::array<float, 2> D = GetIndexClamped(points, index + 2);
            x = CubicHermite(A[0], B[0], C[0], D[0], t);
            y = CubicHermite(A[1], B[1], C[1], D[1], t);

            printf("  Value at %0.2f = (%0.2f, %0.2f)n", tx, x, y);
        }
        printf("n");
    }
 
    WaitForEnter();
    return 0;
}

The output of the program is below:

Links

Here are some interactive HTML5 demos i made:
1D cubic hermite interpolation
2D cubic hermite interpolation

More info here:
Wikipedia: Cubic Hermite Spline

Closely related to cubic hermite splines, catmull-rom splines allow you to specify a “tension” parameter to make the result more or less curvy:
Catmull-Rom spline