Low Tech Homomorphic Encryption

Homomorphic encryption is a special type of encryption that lets you do calculations on encrypted values as if they weren’t encrypted. One reason it’s desired is that secure computing could be done in the cloud, if practical homomorphic encryption were available.

Homomorphic encryption has been a hot research topic since 2009, when Craig Gentry figured out a way to do it while working on his PhD. Since then, people have been working on making it better, faster and more efficient.

You can read more about a basic incarnation of his ideas in my blog posts:
Super Simple Symmetric Leveled Homomorphic Encryption Implementation
Improving the Security of the Super Simple Symmetric Leveled Homomorphic Encryption Implementation

This post is about a low tech type of homomorphic encryption that anyone can easily do and understand. There is also some very simple C++ that implements it.

This idea may very well be known about publically, but I’m not aware of any places that talk about it. I may just be ignorant of them though so ::shrug::

Quick Update

I’ve gotten some feedback on this article, the most often feedback being that this is obfuscation not encryption. I think that’s a fair assessment as the secret value you are trying to protect is in no way transformed, but is just hidden. This post could easily be titled Homomorphic Obfuscation, and perhaps should be.

To see other feedback and responses to this post, check out the reddit links at the bottom!

The Idea

The idea is actually super simple:

  1. Take the value you want to encrypt.
  2. Hide it in a list of a bunch of other random values, and remember where it is in the list. The position in the list is your key.
  3. Send this list to an untrusted party.
  4. They do the same calculation on every item in the list and send it back.
  5. Since you know which value was your secret value, you know which answer is the one you care about.

At the end of that process, you have the resulting value, and they have no idea what value was your secret value. You have done, by definition, homomorphic encryption!

There is a caveat of course… they know that your secret value was ONE of the values on the list.

Security Details

The thing here is that security is a sliding scale between resource usage (computation time, RAM, network bandwidth, etc) and security.

The list size is your security parameter in this case.

A larger list of random values means that it takes longer to transfer the data, more memory to store it, it takes longer to do the homomorphic computations, but the untrusted party is less sure about what your secret value is.

On the other hand, a shorter list is faster to transmit, easier to store, quicker to compute with, but the untrusted party has a better idea what your secret value is.

For maximal security you can just take this to the extreme – if your secret value is a 32 bit floating point number, you could make a list with all possible 2^32 floating point numbers in it, have them do the calculation and send it back. You can even do an optimization here and not even generate or send the list, but rather just have the person doing the calculations generate the full 2^32 float list, do the calculations, and send you the results.

That gets pretty big pretty fast though. That list would actually be 16 gigabytes, but the untrusted party would know almost nothing about your value, other than it can be represented by a 32 bit floating point number.

Depending on your security needs, you might be ok with shortening your list a bit to bring that number down. Making your list only be one million numbers long (999,999 random numbers and one value you actually care about), your list is only 3.8 megabytes.

Not quite as bad.

Some Interesting Abilities

Using this homomorphic encryption, like other homomorphic encryption, you can do computation involving multiple encrypted values. AKA you could multiply two encrypted values together. To do this, you are going to need to encrypt all values involved using the same key. In other words, they are going to have to be at the same index in each of their respective lists of random numbers.

Something else that is interesting is that you can also encode MULTIPLE secret values in your encrypted value list. You could have 1 secret value at index 50 and another at index 100 for instance. Doing this, you get a sort of homomorphic SIMD setup.

Homomorphic SIMD is actually a real thing in other homomorphic encryption methods as well. Check out this paper for instance:
Fully Homomorphic SIMD Operations

The only problem with homomorphic SIMD is that adding more secret values to the same encrypted list decreases the security, since there are more values in the list that you don’t want other people to know about.

You can of course also modify encrypted values by unencrypted values. You could multiply an encrypted value by 3, by multiplying every value in the list by 3.

Extending to Public Key Cryptography

If you wanted to use asymmetric key cryptography (public/private keys) instead of symmetric key cryptography, that is doable here too.

What you would do is have the public key public as per usual, and that key would be used in a public key algorithm to encrypt the index of the secret value in the random list.

Doing this, the person who has the private key would be able to receive the list and encrypted index, decrypt the index, and then get the secret value out using that index.

Sample Code Tests

The sample code only does Symmetric key encryption, and does these 3 tests:

  1. Encrypts two floating point numbers into a single list, SIMD style, does an operation on the encrypted values, then unencrypts and verifies the results.
  2. Does the same with two sets of floats (three floats in each set), to show how you can make encrypted values interact with each other. Does the operation, then unencrypts and verifies the results.
  3. Encrypts three values of a 3 byte structure, does an operation on the encrypted values, then unencrypts and verifies the results.

All secret data was hidden in lists of 10,000,000 random values. That made the first two tests (the ones done with 4 byte floats) have encrypted files of 38.1MB (40,000,000 bytes), and the last test (the one done with a 3 byte struct) had a file size of 28.6 MB (30,000,000 bytes).

Here are the timing of the above tests:

Sample Code

Here is the header, LTHE.h:

/*

Written by Alan Wolfe
http://blog.demofox.org


*/

#pragma once
#include <vector>
#include <random>

// A static class with template functions in it.
// A namespace would be nice, except I want to hide some things as private.
class LTHE
{
public:

    //=================================================================================
    template <typename T>
    static bool Encrypt (std::vector<T> values, size_t listSize, const char* fileName, std::vector<size_t>& keys, bool generateKeys = true)
    {
        // Make sure we have a list that is at least as long as the values we want to encrypt
        if (values.size() > listSize)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): values.size() > listSize.n");
            return false;
        }

        // Generate a list of keys if we are told to
        // Ideally you want to take the first M items of a cryptographically secure shuffle
        // of N items.
        // This could be done with format preserving encryption or some other method
        // to make it not roll and check, and also more secure random.
        if (generateKeys)
        {
            keys.clear();
            for (size_t i = 0, c = values.size(); i < c; ++i)
            {
                size_t newKey;
                do
                {
                    newKey = RandomInt<size_t>(0, listSize - 1);
                }
                while (std::find(keys.begin(), keys.end(), newKey) != keys.end());
                keys.push_back(newKey);
            }
        }

        // make a file of random values, size of T, count of <listSize> 
        FILE *file = fopen(fileName, "w+b");
        if (!file)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", fileName);
            return false;
        }

        // Note: this may not be the most efficient way to generate this much random data or 
        // write it all to the file.
        // In a real crypto usage case, you'd want a crypto secure random number generator.
        // You'd also want to make sure the random numbers had the same properties as your
        // input values to help anonymize them better.
        // Like if your numbers are not whole numbers, you don't want to generate only whole numbers.
        // Or if your numbers are salaries, you may not want purely random values, but more "salaryish"
        // looking numbers.
        // You could alternately just do all 2^N possible values which would definitely anonymize
        // the values you wanted to encrypt.  This is maximum security, but also takes most
        // memory and most processing time.
        size_t numUint32s = (listSize * sizeof(T)) / sizeof(uint32_t);
        size_t numExtraBytes = (listSize * sizeof(T)) % sizeof(uint32_t);
        for (size_t i = 0; i < numUint32s; ++i)
        {
            uint32_t value = RandomInt<uint32_t>();
            if (fwrite(&value, sizeof(value), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write random numbers (uint32s).n");
                fclose(file);
                return false;
            }
        }
        for (size_t i = 0; i < numExtraBytes; ++i)
        {
            uint8_t value = RandomInt<uint8_t>();
            if (fwrite(&value, sizeof(value), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write random numbers (extra bytes).n");
                fclose(file);
                return false;
            }
        }

        // Now put the values in the file where they go, based on their key
        for (size_t i = 0, c = values.size(); i < c; ++i)
        {
            long pos = (long)(keys[i] * sizeof(T));
            if (fseek(file, pos, SEEK_SET) != 0)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not fseek.n");
                fclose(file);
                return false;
            }
            if (fwrite(&values[i], sizeof(values[i]), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write secret value.n");
                fclose(file);
                return false;
            }
        }

        // close file and return success
        fclose(file);
        return true;
    }

    //=================================================================================
    template <typename T, typename LAMBDA>
    static bool TransformHomomorphically (const char* srcFileName, const char* destFileName, const LAMBDA& function)
    {
        // open the source and dest file if we can
        FILE *srcFile = fopen(srcFileName, "rb");
        if (!srcFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", srcFileName);
            return false;
        }
        FILE *destFile = fopen(destFileName, "w+b");
        if (!destFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", destFileName);
            fclose(srcFile);
            return false;
        }

        // Process the data in the file and write it back out.
        // This could be done much better.
        // We could read more from the file at once.
        // We could use SIMD.
        // We could go multithreaded.
        // We could do this on the GPU for large data sets and longer transformations! Assuming data transfer time isn't too prohibitive.
        // We could decouple the disk access from processing, so it was reading and writing while it was processing.
        const size_t c_bufferSize = 1024;
        std::vector<T> dataBuffer;
        dataBuffer.resize(c_bufferSize);
        size_t elementsRead;
        do
        {
            // read data from the source file
            elementsRead = fread(&dataBuffer[0], sizeof(T), c_bufferSize, srcFile);

            // transform the data
            for (size_t i = 0; i < elementsRead; ++i)
                dataBuffer[i] = function(dataBuffer[i]);

            // write the transformed data to the dest file
            if (fwrite(&dataBuffer[0], sizeof(T), elementsRead, destFile) != elementsRead)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write transformed elements.n");
                fclose(srcFile);
                fclose(destFile);
                return false;
            }
        }
        while (!feof(srcFile));

        // close files and return success
        fclose(srcFile);
        fclose(destFile);
        return true;
    }

    //=================================================================================
    template <typename T, typename LAMBDA>
    static bool TransformHomomorphically (const char* src1FileName, const char* src2FileName, const char* destFileName, const LAMBDA& function)
    {
        // open the source and dest file if we can
        FILE *srcFile1 = fopen(src1FileName, "rb");
        if (!srcFile1)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", src1FileName);
            return false;
        }
        FILE *srcFile2 = fopen(src2FileName, "rb");
        if (!srcFile2)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", src2FileName);
            fclose(srcFile1);
            return false;
        }
        FILE *destFile = fopen(destFileName, "w+b");
        if (!destFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", destFileName);
            fclose(srcFile1);
            fclose(srcFile2);
            return false;
        }

        // Process the data in the file and write it back out.
        // This could be done much better.
        // We could read more from the file at once.
        // We could use SIMD.
        // We could go multithreaded.
        // We could do this on the GPU for large data sets and longer transformations! Assuming data transfer time isn't too prohibitive.
        // We could decouple the disk access from processing, so it was reading and writing while it was processing.
        const size_t c_bufferSize = 1024;
        std::vector<T> dataBuffer1, dataBuffer2;
        dataBuffer1.resize(c_bufferSize);
        dataBuffer2.resize(c_bufferSize);
        size_t elementsRead1;
        size_t elementsRead2;
        do
        {
            // read data from the source files
            elementsRead1 = fread(&dataBuffer1[0], sizeof(T), c_bufferSize, srcFile1);
            elementsRead2 = fread(&dataBuffer2[0], sizeof(T), c_bufferSize, srcFile2);

            if (elementsRead1 != elementsRead2)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Different numbers of elements in each file!n");
                fclose(srcFile1);
                fclose(srcFile2);
                fclose(destFile);
                return false;
            }

            // transform the data
            for (size_t i = 0; i < elementsRead1; ++i)
                dataBuffer1[i] = function(dataBuffer1[i], dataBuffer2[i]);

            // write the transformed data to the dest file
            if (fwrite(&dataBuffer1[0], sizeof(T), elementsRead1, destFile) != elementsRead1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write transformed elements.n");
                fclose(srcFile1);
                fclose(srcFile2);
                fclose(destFile);
                return false;
            }
        }
        while (!feof(srcFile1));

        // close files and return success
        fclose(srcFile1);
        fclose(srcFile2);
        fclose(destFile);
        return true;
    }

    //=================================================================================
    template <typename T>
    static bool Decrypt (const char* fileName, std::vector<T>& values, std::vector<size_t>& keys)
    {
        // Open the file if we can
        FILE *file = fopen(fileName, "rb");
        if (!file)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", fileName);
            return false;
        }

        // Read the values from the file.  The key is their location in the file.
        values.clear();
        for (size_t i = 0, c = keys.size(); i < c; ++i)
        {
            long pos = (long)(keys[i] * sizeof(T));
            if (fseek(file, pos, SEEK_SET) != 0)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not fseek.n");
                fclose(file);
                return false;
            }
            T value;
            if (!fread(&value, sizeof(T), 1, file))
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not decrypt value for key.n");
                fclose(file);
                return false;
            }
            values.push_back(value);
        }

        // Close file and return success
        fclose(file);
        return true;
    }

private:
    template <typename T>
    static T RandomInt (T min = std::numeric_limits<T>::min(), T max = std::numeric_limits<T>::max())
    {
        static std::random_device rd;
        static std::mt19937 mt(rd());
        static std::uniform_int<T> dist(min, max);
        return dist(mt);
    }
};

And here is the test program, main.cpp:

#include <stdio.h>
#include "LTHE.h"
#include <chrono>

//=================================================================================
// times a block of code
struct SBlockTimer
{
    SBlockTimer()
    {
        m_start = std::chrono::high_resolution_clock::now();
    }

    ~SBlockTimer()
    {
        std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
        printf("    %0.2f secondsn", seconds.count());
    }

    std::chrono::high_resolution_clock::time_point m_start;
};

//=================================================================================
float TransformDataUnitary (float& value)
{
    return (float)sqrt(value * 2.17f + 0.132);
}

//=================================================================================
float TransformDataBinary (float& value1, float value2)
{
    return (float)sqrt(value1 * value1 + value2 * value2);
}

//=================================================================================
struct SStruct
{
    uint8_t x, y, z;

    static SStruct Transform (const SStruct& b)
    {
        SStruct ret;
        ret.x = b.x * 2;
        ret.y = b.y * 3;
        ret.z = b.z * 4;
        return ret;
    }

    bool operator != (const SStruct& b) const
    {
        return b.x != x || b.y != y || b.z != z;
    }
};

//=================================================================================
int Test_FloatUnitaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<float> secretValues = { 3.14159265359f, 435.0f };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues, 10000000, "Encrypted.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<float>("Encrypted.dat", "Transformed.dat", TransformDataUnitary))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<float> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues.size(); i < c; ++i)
        {
            if (TransformDataUnitary(secretValues[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int Test_FloatBinaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<float> secretValues1 = { 3.14159265359f, 435.0f, 1.0f };
    std::vector<float> secretValues2 = { 1.0f, 5.0f, 9.0f };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues1, 10000000, "Encrypted1.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
        if (!LTHE::Encrypt(secretValues2, 10000000, "Encrypted2.dat", keys, false)) // reuse the keys made for secretValues1
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<float>("Encrypted1.dat", "Encrypted2.dat", "Transformed.dat", TransformDataBinary))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<float> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues1.size(); i < c; ++i)
        {
            if (TransformDataBinary(secretValues1[i], secretValues2[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int Test_StructUnitaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<SStruct> secretValues = { {0,1,2},{ 3,4,5 },{ 6,7,8 } };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues, 10000000, "Encrypted.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<SStruct>("Encrypted.dat", "Transformed.dat", SStruct::Transform))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<SStruct> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues.size(); i < c; ++i)
        {
            if (SStruct::Transform(secretValues[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int main (int argc, char **argv)
{
    // test doing an operation on a single encrypted float
    int ret = Test_FloatUnitaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }

    // test doing an operation on two encrypted floats
    ret = Test_FloatBinaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }

    // test doing an operation on a single 3 byte struct
    ret = Test_StructUnitaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }
    
    printf("nAll Tests Passed!nn");
    system("pause");
    return 0;
}

If you found this post interesting or useful, or you have anything to add or talk about, let me know!

Reddit discussion:
r/programming
r/cryptography

A Data Point for MSVC vs Clang Code Generation

I’m a windows PC game developer using MSVC 2015 update 1, working in C++.

More and more, I hear people talk about how great clang is, and that it generates much better code than MSVC, among other niceties.

I have been chalking it up to maybe just being a fad, but keeping my feelers out there to see if I can get some concrete comparitive info.

Well, I’ve stumbled on some of that concrete comparitive info and want to share it with anyone else who is wondering if clang really is better than MSVC. This is just one data point, but it feels like it’s just the tip of the iceberg.

On twitter, Jason Turner (@lefticus) had an interesting tweet:

I wasn’t really sure what he was getting at, so clicked the link to check it out (http://godbolt.org/g/WDpPYq).

It turned out to be very relevant to my interest, because his particular example is comparing a value to a bunch of compile time constants. That is basically the core of what I’ve been looking into with my last few posts asking whether code or data was faster!

This first particular example is comparing a compile time constant to other compile time constants, so the code completely melts away and at runtime just returns the compile time calculated result. That isn’t very interesting, but it is nice to see that clang did so much at compile time. FWIW MSVC was able to do this all at compile time as well, so they are even so far.

What is more interesting is what happens when you test a compile time unknown against compile time constants. Let’s check it out… (https://godbolt.org/g/dKBDSK)

What the assembly does is subtract 1 from the input (it’s unsigned so if 0, wraps around to max int value), and then compares it against 5 to know if it’s in the group or not. Clang realized the numbers were continuous and so made a nice optimization.

In this case, MSVC did similar in release x64:

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
    bool match = false;
    (void)std::initializer_list<bool>{ (match = match || u == t)... };
    return match;
}

int main(int argc, char** argv)
{
00007FF62DB61000  lea         eax,[rcx-1]  
00007FF62DB61003  cmp         eax,4  
00007FF62DB61006  setbe       al  
    return one_of(argc, 1, 2, 3, 4, 5);
00007FF62DB61009  movzx       eax,al  
}
00007FF62DB6100C  ret  

But in x86 release it did a bunch of if/else if/else if’s!

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
    bool match = false;
    (void)std::initializer_list<bool>{ (match = match || u == t)... };
    return match;
}

int main(int argc, char** argv)
{
00331002  in          al,dx  
    return one_of(argc, 1, 2, 3, 4, 5);
00331003  mov         eax,dword ptr [argc]  
00331006  cmp         eax,1  
00331009  je          main+26h (0331026h)  
0033100B  cmp         eax,2  
0033100E  je          main+26h (0331026h)  
00331010  cmp         eax,3  
00331013  je          main+26h (0331026h)  
00331015  cmp         eax,4  
00331018  je          main+26h (0331026h)  
0033101A  cmp         eax,5  
0033101D  je          main+26h (0331026h)  
0033101F  xor         al,al  
00331021  movzx       eax,al  
}
00331024  pop         ebp  
00331025  ret  
    return one_of(argc, 1, 2, 3, 4, 5);
00331026  mov         al,1  
00331028  movzx       eax,al  
}
0033102B  pop         ebp  
0033102C  ret  

You are probably asking “what does clang do in x86?” well it turns out it does the same thing as in x64, it doesn’t fall back to if/else if/else if like MVSC does (proof: add -m32 in goldbolt. https://godbolt.org/g/khnrtO). One point to clang!

What if the numbers are not so continuous though? It turns out it can actually switch to using a binary search! (https://godbolt.org/g/iBkqja)

MSVC on the other hand just does a bunch of if/else if/else if tests, in both x86 release and x64 release.

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
    bool match = false;
    (void)std::initializer_list<bool>{ (match = match || u == t)... };
    return match;
}

int main(const int argc, const char *[])
{
00007FF6C05A1000  cmp         ecx,1AB42h  
00007FF6C05A1006  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1008  cmp         ecx,40Fh  
00007FF6C05A100E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1010  cmp         ecx,0B131h  
00007FF6C05A1016  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1018  cmp         ecx,93BBh  
00007FF6C05A101E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1020  cmp         ecx,121Bh  
00007FF6C05A1026  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1028  cmp         ecx,0EE9h  
00007FF6C05A102E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1030  cmp         ecx,0E1Fh  
00007FF6C05A1036  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1038  cmp         ecx,995h  
00007FF6C05A103E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1040  cmp         ecx,5FEh  
00007FF6C05A1046  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1048  cmp         ecx,5BFh  
00007FF6C05A104E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1050  cmp         ecx,5  
00007FF6C05A1053  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1055  cmp         ecx,0FFFEh  
00007FF6C05A105B  je          main+63h (07FF6C05A1063h)  
    return one_of(argc, 1471, 2453, 3817, 45361, 37819, 109378, 1534, 4635, 1039, 3615, 5, 65534);
00007FF6C05A105D  xor         al,al  
00007FF6C05A105F  movzx       eax,al  
}
00007FF6C05A1062  ret  
    return one_of(argc, 1471, 2453, 3817, 45361, 37819, 109378, 1534, 4635, 1039, 3615, 5, 65534);
00007FF6C05A1063  mov         al,1  
00007FF6C05A1065  movzx       eax,al  
}
00007FF6C05A1068  ret  

Closing

This is just one data point about how clang is better than MSVC, but it is a data point. I’m betting there are more if we looked for them.

This makes me wonder how switch statements do in clang vs msvc, and also makes me wonder if clang ever uses jump tables or more advanced data structures in either switch statements, or other code that does comparison against a potentially large number of compile time constants. Those thoughts are driven by this things seen in this article: Something You May Not Know About the Switch Statement in C/C++

The examples above used C++14 level C++ to implement “one_of”. If you can use C++17 level C++, you can also implement it this way, which also does the binary search (Also written by Jason Turner):
https://godbolt.org/g/RZgjRQ

PS wouldn’t it be nice if godbolt supported MSVC so we could do this sort of analysis on MSVC code? It’s in the works, but unsure when it’ll be available. Apparently licensing isn’t the issue, so lets hope it comes sooner rather than later! If you want it, maybe ping @mattgodbolt and let him know how much you want that functionality (:

Have any other clang vs MSVC info? If so, I’d love to hear about it!

Is Code Faster Than Data? Examining Hash Tables

This series of posts is aimed at examining if and how ad hoc code crafted for a specific static (unchanging / constant) data set can run faster than typical generic run time data structures. I think the answer is an obvious “yes, we can often do better”, but these posts explore the details of the problem space and explore how and when we might do better.

The last post explored switch statement performance compared to array access performance.

A switch statement is just a way of telling the compiler how we want to map integral inputs to either some code to run, or some value to return. It’s up to the compiler how to make that happen.

Because compiler makers presumably want to make their compiler generate fast code, it seems like a switch statement should be able to match the speed of an array since a switch statement could very well be implemented as an array when all it is doing is returning different values based on input. Maybe it could even beat an array, in the case of a sparse array, an array with many duplicates, or other situations.

In practice, this doesn’t seem to be the case though, and switch statements are actually quite a bit slower than arrays from the experimentation I’ve done. The main part of the overhead seems to be that it always does a jump (goto) based on the input you are switching on. It can do some intelligent things to find the right location to jump to, but if all you are doing is returning a value, it doesn’t seem smart enough to do a memory read from an array and return that value, instead of doing a jump.

You can read a nice analysis on how switch statements are compiled on the microsoft compiler here: Something You May Not Know About the Switch Statement in C/C++.

Today we are going to be analyzing how hash tables fare against switch statements, arrays, and a few other things.

Testing Details

I ran these tests in x86/x64 debug/release in visual studio 2015.

I got a list of 100 random words from http://www.randomwordgenerator.com/ and made sure they were all lowercase. I associated an integer value with them, from 1 to 100. My tests are all based on the string being the key and the integer being the value.

I have that data stored/represented in several ways for performing lookups:

  1. std::map.
  2. std::unordered_map.
  3. std::unordered_map using crc32 hash function.
  4. std::unordered_map using crc32 hash function modulo 337 and salted with 1147287 to prevent collisions.
  5. SwitchValue() switches on crc32 of input string.
  6. SwitchValueValidate() switches on crc32 of input string but does a single strcmp to handle possibly invalid input.
  7. SwitchValueMinimized() switches on crc32 of input string modulo 337 and salted with 1147287 to prevent collisions.
  8. SwitchValueMinimizedValidate() like SwitchValueMinimized() but does a single strcmp to handle possibly invalid input.
  9. g_SwitchValueMinimizedArray, the array version of SwitchValueMinimized().
  10. g_SwitchValueMinimizedArrayValidate, the array version of SwitchValueMinimizedValidate().
  11. BruteForceByStartingLetter() switches on first letter, then brute force strcmp’s words beginning with that letter.
  12. BruteForce() brute force strcmp’s all words.

The non validating switch statement functions have an __assume(0) in their default case to remove the overhead of testing for invalid values. This is to make them as fast as possible for the cases when you will only be passing valid values in. If ever that contract was broken, you’d hit undefined behavior, so the performance boost comes at a cost. The Validate versions of the switch functions don’t do this, as they are meant to take possibly invalid input in, and handle it gracefully. Both validating and not validating input are common use cases so I wanted to represent both in the performance analysis.

Here are the tests done:

  1. In Order – looks up all strings in order and sums the associated values.
  2. Shuffled – looks up all strings in random order and sums the associated values.
  3. Pre-Hashed Keys In Order – looks up all strings in order and sums the associated values, using pre-hashed keys.
  4. Pre-Hashed Keys Shuffled – looks up all strings in random order and sums the associated values, using pre-hashed keys.

The second two tests only apply to the value lookups which can take pre-hashed keys. For instance, g_SwitchValueMinimizedArray can be indexed by a key that was hashed before the program ran, but a std::unordered_map cannot be indexed by a hash value that was calculated in advance.

Each of those tests were done 5,000 times in a row to make performance differences stand out more, and that full amount of time is the time reported. That process was done 50 times to give both an average (a mean) and a standard deviation to show much much the time samples differed.

The source code for the tests can be found here:
Github: Atrix256/RandomCode/HashVsSwitch

Results

Here are the results, in milliseconds. The values in parentheses are the standard deviations, which are also in milliseconds.

In Order

Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.

Debug Release
Win32 x64 Win32 x64
std::map 7036.77 (126.41) 7070.18 (155.49) 33.02 (2.68) 35.40 (1.43)
std::unordered_map 4235.31 (24.41) 4261.36 (145.16) 19.97 (0.45) 20.12 (0.62)
std::unordered_map crc32 4236.38 (80.72) 4275.36 (116.65) 24.36 (0.47) 23.47 (0.86)
std::unordered_map crc32 minimized 4034.50 (12.72) 4323.67 (170.55) 26.39 (0.50) 23.68 (0.71)
SwitchValue() 123.28 (0.98) 144.29 (4.91) 6.81 (0.30) 5.47 (0.29)
SwitchValueValidate() 127.59 (1.22) 147.41 (5.20) 8.84 (0.35) 7.99 (0.36)
SwitchValueMinimized() 128.83 (0.95) 151.48 (4.66) 8.28 (0.38) 10.18 (0.37)
SwitchValueMinimizedValidate() 132.44 (1.02) 159.85 (6.73) 12.65 (0.40) 10.89 (0.36)
g_SwitchValueMinimizedArray 104.15 (1.13) 122.94 (5.98) 7.68 (0.36) 6.08 (0.36)
g_SwitchValueMinimizedArrayValidate 107.75 (1.07) 120.75 (2.80) 10.49 (0.37) 8.95 (0.32)
BruteForceByStartingLetter() 19.92 (0.63) 22.01 (0.86) 4.85 (0.24) 5.81 (0.26)
BruteForce() 118.65 (1.09) 140.20 (2.28) 31.53 (0.56) 46.47 (0.83)

Shuffled

Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.

Debug Release
Win32 x64 Win32 x64
std::map 7082.92 (214.13) 6999.90 (193.82) 32.14 (0.59) 34.20 (0.62)
std::unordered_map 4155.85 (133.00) 4221.84 (124.70) 20.21 (0.42) 20.09 (0.47)
std::unordered_map crc32 4286.44 (95.39) 4300.81 (64.37) 24.55 (0.57) 23.06 (0.57)
std::unordered_map crc32 minimized 4186.27 (75.35) 4111.73 (43.36) 26.36 (0.56) 23.65 (0.54)
SwitchValue() 127.93 (3.85) 137.63 (1.31) 6.97 (0.32) 5.47 (0.27)
SwitchValueValidate() 131.46 (2.34) 141.38 (1.47) 8.92 (0.38) 7.86 (0.37)
SwitchValueMinimized() 133.03 (2.93) 145.74 (1.50) 9.16 (0.37) 10.50 (0.41)
SwitchValueMinimizedValidate() 135.47 (2.27) 151.58 (1.48) 12.13 (0.40) 10.13 (0.43)
g_SwitchValueMinimizedArray 106.38 (2.70) 118.61 (3.73) 8.18 (0.31) 5.19 (0.29)
g_SwitchValueMinimizedArrayValidate 109.32 (2.34) 120.94 (3.02) 10.49 (0.55) 9.00 (0.40)
BruteForceByStartingLetter() 20.45 (0.92) 21.64 (0.76) 4.90 (0.31) 5.87 (0.32)
BruteForce() 120.70 (2.16) 140.95 (1.71) 32.50 (0.47) 45.90 (0.79)

Pre-hashed In Order

Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.

Debug Release
Win32 x64 Win32 x64
SwitchValue() 12.49 (0.61) 13.23 (0.37) 1.94 (0.17) 1.81 (0.12)
SwitchValueValidate() 17.08 (1.06) 16.72 (0.57) 4.32 (0.30) 4.05 (0.21)
SwitchValueMinimized() 11.83 (0.69) 12.06 (0.51) 1.29 (0.13) 1.58 (0.17)
SwitchValueMinimizedValidate() 16.02 (0.84) 15.84 (0.66) 3.25 (0.24) 3.47 (0.27)
g_SwitchValueMinimizedArray 1.23 (0.06) 1.15 (0.10) 0.00 (0.00) 0.00 (0.00)
g_SwitchValueMinimizedArrayValidate 4.21 (0.32) 2.99 (0.20) 2.45 (0.17) 2.66 (0.20)

Pre-hashed Shuffled

Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.

Debug Release
Win32 x64 Win32 x64
SwitchValue() 12.96 (1.37) 13.45 (0.47) 1.84 (0.11) 1.81 (0.16)
SwitchValueValidate() 16.27 (2.01) 16.57 (0.63) 2.65 (0.19) 2.85 (0.17)
SwitchValueMinimized() 11.75 (0.63) 12.15 (0.45) 1.07 (0.07) 1.06 (0.11)
SwitchValueMinimizedValidate() 16.44 (0.99) 16.33 (0.58) 3.43 (0.18) 3.41 (0.22)
g_SwitchValueMinimizedArray 1.13 (0.06) 1.18 (0.10) 0.32 (0.05) 0.31 (0.04)
g_SwitchValueMinimizedArrayValidate 4.50 (0.32) 3.31 (0.18) 2.82 (0.16) 3.29 (0.18)

Observations

There’s a lot of data, but here’s the things I found most interesting or relevant to what I’m looking at (generic data structures vs ad hoc code for data).

Tests 1 and 2

std::map and std::unordered map are very, very slow in debug as you might expect. It would be interesting to look deeper and see what it is that they are doing in debug to slow them down so much.

There is some tribal knowledge in the C++ world that says to not use std::map and to use std::unordered_map instead, but I was surprised to see just how slow std::map was. in x64 release, std::map took about 75% the time that brute force did, and in win32 release, it took the same time or was slower! std::map isn’t hash based, you give it a comparison function that returns -1,0, or 1 meaning less than, equal or greater than. Even so, you have to wonder how the heck the algorithm can be so slow that brute force is a comparable replacement for lookup times!

It’s interesting to see that everything i tried (except brute force) was significantly faster than both std::map and std::unordered_map. That saddens me a little bit, but to be fair, the usage case I’m going after is a static data structure that has fast lookup speeds, which isn’t what unordered_map aims to solve. This just goes to show that yes, if you have static data that you want fast lookup times for, making ad hoc code or rolling your own read only data structure can give you significant wins to performance, and also can help memory issues (fragmentation and wasted allocations that will never be used).

It was surprising to see that switching on the first letter and brute forcing the strings with the same first letter did so well. That is one of the faster results, competing with SwitchValue() for top dog. The interesting thing though is that BruteForceByStartingLetter() gracefully handles invalid input, while SwitchValue() does not and has undefined behavior, so another point goes to BruteForceByStartingLetter().

Tests 3 and 4

These tests were done with pre-hashed keys to simulate an ideal setup.

If you have a static key to value data structure and have the ability to make ad hoc code for your specific static data, chances are pretty good that you’ll also be able to pre-hash whatever keys you are going to be looking up so you don’t have to hash them at run time. Also, if you are doing multiple lookups with a single key for some reason, you may opt to calculate the hash only on the first lookup, and then from there re-use the hashed key.

These tests simulated those situations.

As expected, the perf results on these tests are much better than those that hash the key on demand for each lookup. Less work done at runtime means better performance.

Based on the results of the last blog post – that array lookups are super fast – you probably aren’t surprised to see that g_SwitchValueMinimizedArray is the winner for performance by far.

It is so fast that the in order case doesn’t even register any time having been taken. This is probably a little misleading, because doing the in order tests (and even the shuffled tests) are very cache friendly. In reality, you probably would have more cache misses and it wouldn’t be quite as cheap as what is being reported, but would still be super fast compared to the other options.

In second place comes SwitchValueMinimized() which is the switch statement function version of g_SwitchValueMinimizedArray. Arrays still beat switch statements, as we found out in the last post!

In third place comes SwitchValue(), which is the same as SwitchValueMinimized() but has sparser values used in the switch statement, which make it more difficult for the compiler to generate efficient code. For instance, having the full range of 32 bits as case statement values, and having them all be pseudo random numbers (because they are the result of a hash!) rules out the ability for the compiler to make a jump table array, or find any patterns in the numbers. The SwitchValueMinimized() function on the other hand has only 337 possible values, and so even though the values are sparse (there are 100 items in those 337 possible values), it’s a small enough number that a jump table array could be used without issues.

After that comes all the validated versions of the tests. It makes sense that they would be slower, because they do all the same work, and then some additional work (strcmp) to ensure that the input is valid.

Getting The Fastest Results

If you have some static data that maps keys to values, and you need it to be fast for doing lookups, it looks like writing something custom is the way to go.

The absolutely fastest way to do it is to make an array out of your data items and then pre-process (or compile time process) any places that do a lookup, to convert keys to array indices. then, at run time, you only need to do an array lookup to a known index to get your data, which is super fast. If your data has duplicates, you might also be able to make the keys which point at duplicate data instead just point at the same array index, to de-duplicate your data.

If doing that is too complex, or too much work, a low tech and low effort way to handle the problem seems to be to break your data up into buckets, possibly based on their first letter, and then doing brute force (or something else) to do the lookup among the fewer number of items.

In fact, that second method is sort of like a hard coded trie which is only two levels deep.

If you needed to do some hashing at runtime, finding a faster hash function (that also worked in constexpr, or at least had a constexpr version!) could help you get closer to the pre-hashed keys timings. The good news is the hash doesn’t have to be particularly good. It just has to be fast and have no collisions for the input values you wish to use. That seems like something where brute force searching simple hash functions with various salt values may give you the ideal solution, but probably would take a very, very long time to find what you are looking for. You might notice that the default hash used for std::unordered_map is actually faster than the crc32 implementation I used.

Of course, we also learned what NOT to do. Don’t use brute force, and don’t use std::map. Using std::unordered_map isn’t super aweful compared to those solutions, but you can do a lot better if you want to.

Why This?

This fast key to value lookup might sound like a contrived need to some people, but in game development (I am a game developer!), there is commonly the concept of a game database, where you look up data about things (how much damage does this unit do?) by looking up a structure based on a unique ID that is a string, named by a human. So, in game dev, which also has high performance needs, optimizing this usage case can be very helpful. There is a little bit more talk about game data needs here: Game Development Needs Data Pipeline Middleware.

Is Code Faster Than Data?

I still think ad hoc code for data structures can often be faster than generic data structures, and the experiments on this post have positive indications of that.

Another way I think ad hoc code could be helpful is when you have hierarchical and/or heterogeneous data structures. By that I mean data structures which have multiple levels, where each level may actually have different needs for how best to look up data in it, and in fact, even siblings on the same level maybe have different needs for how best to look up data in it.

In these cases, you could make some data types which had virtual functions to handle the nature of the data needing different solutions at each level, but those virtual function calls and abstractions add up.

I think it’d be superior to have hard coded code that says “oh, you want index 4 of the root array? ok, that means you are going to binary search this list next”. Of course, that code needs to be generated by another program to be effective. If a human has to make sure all that code stays up to date, it’ll be a ton of work, and it’ll be broken, making very subtle hard to reproduce bugs.

A downside I can see to ad hoc code solutions is possibly thrashing the instruction cache more. Not sure if that’d be an issue in practice, it’d be interesting to try more complex data structures and see how it goes.

Also, it might be difficult to have good heuristics to figure out what is best in which situations. I could see a utility possibly generating different variations of code and running them to see which was most performant. Seems like it’d be a challenge to get 100% right all the time, but our experiments make it seems like it’d be easy to do significantly better than generic algorithms which are meant to be dynamic at runtime.

I also think that more complex data structures are more likely to get benefit of having custom code made for them. Simple ones less likely so. It’s hard to beat an array lookup. That being said, the unbeatable simple data structures make great building blocks for the more complex ones (;

It probably would also be good to look into memory usage a bit more to see how ad hoc code compares to generic algorithms. If ad hoc code is much faster but uses more memory, that’d have to be a conscious decision to make when weighing the trade offs.

Maybe in the future, the C++ standards will allow for static data structure types that you have to initialize with compile time constants (allowing constexpr), that are optimized for lookup times since they can’t have any inserts or deletes? I wonder how much demand there would be for that?

Here’s a good read on some other string friendly data structures:
Data Structures for Strings

Twitter user @ores brought up two interesting points:

  1. It would be interesting to see gperf performs in this situation. If makes a faster minimal perfect hash function, it’ll get us closer to the pre-hashed keys timings.
  2. It would be interesting to time scripting languages to see if for them code is faster than data or not. Another interesting aspect of this would be to look at a JIT compiled scripting language like lua-jit. The thing that makes JIT interesting is that it can compile for your specific CPU, instead of having to compile for a more generic set of CPU features. That gives it the opportunity to make code which will perform better on your specific machine.