The Secret to Writing Fast Code / How Fast Code Gets Slow

This is a “soft tech” post. If that isn’t your thing, don’t worry, I’ll be returning to some cool “hard tech” and interesting algorithms after this. I’ve been abusing the heck out of the GPU texture sampler lately, so be on the lookout for some posts on that soon (;

I’m about to show you some of the fastest code there is. It’s faster than the fastest real time raytracer, it’s faster than Duff’s Device.

Heck, despite the fact that it runs on a classical computer, it runs faster than Shor’s Algorithm which uses quantum computing to factor integers so quickly that it breaks modern cryptographic algorithms.

This code also runs faster than Grover’s Algorithm which is another quantum algorithm that can search an unsorted list in O(sqrt(N)).

Even when compiled in debug it runs faster than all of those things.

Are you ready? here it is…

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
    return 0;
}

Yes, the code does nothing and that is precisely why it runs so fast.

The Secret to Writing Fast Code

The secret to writing fast code, no matter what you are writing is simple: Don’t do anything that is too slow.

Follow me on a made up example to see what I’m talking about.

Let’s say you started with a main() function like i showed above and you decided you want to make a real time raytracer that runs on the CPU.

First thing you do is figure out what frame rate you want it to run at, at the desired resolution. From there, you know how many milliseconds you have to render each frame, and now you have a defined budget you need to stay inside of. If you stay in that budget, you’ll consider it a real time raytracer. If you go outside of that budget, it will no longer be real time, and will be a failed program.

You may get camera control working and primary rays intersecting a plane, and find you’ve used 10% of your budget and 90% of the budget remains. So far so good.

Next up you add some spheres and boxes, diffuse and specular shade them with a directional light and a couple point lights. You find that you’ve used 40% of your budget, and 60% remains. We are still looking good.

Next you decide you want to add reflection and refraction, allowing up to 3 ray bounces. You find you are at 80% of your budget and are still looking good. We are still running fast enough to be considered real time.

Now you say to yourself “You know what? I’m going to do 4x super sampling for anti aliasing!”, so you shoot 4 rays out per pixel instead of 1 and average them.

You profile and uh oh! You are at 320% of your budget! Your ray tracer is no longer real time!

What do you do now? Well, hopefully it’s obvious: DON’T DO THAT, IT’S TOO SLOW!

So you revert it and maybe drop in some FXAA as a post processing pass on your render each frame. Now you are at 95% of your budget let’s say.

Now you may want to add another feature, but with only 5% of your budget left you probably don’t have much performance to spare to do it.

So, you implement whatever it is, find that you are at 105% of your budget.

Unlike the 4x super sampling, which was 220% overbudget, this new feature being only 5% over budget isn’t THAT much. At this point you could profile something that already exists (maybe even your new feature) and see if you can improve it’s performance, or if you can find some clever solution that gives you a performance boost, at the cost of things you don’t care about, you can do that to get some performance back. This is a big part of the job as a successful programmer / software engineer – make trade offs where you gain benefits you care about, at the cost of things you do not care about.

At this point, you can also decide if this new feature is more desired than any of the existing features. If it is, and you can cut an old feature you don’t care about anymore, go for it and make the trade.

Rinse and repeat this process with new features and functionality until you have the features you want, that fit within the performance budget you have set.

Follow this recipe and you too will have your very own real time raytracer (BTW related:Making a Ray Traced Snake Game in Shadertoy).

Maintaining a performance budget isn’t magic. It’s basically subtractive synthesis. Carve time away from your performance budget by adding a feature, then optimize or remove features if you are over budget. Rinse and repeat until the sun burns out.

Ok, so if it’s so easy, why do we EVER have performance problems?

How Fast Code Gets Slow

Performance problems come up when we are not paying attention. Sometimes we cause them for ourselves, and sometimes things outside of our control cause them.

The biggest way we cause performance problems for ourselves is by NOT MEASURING.

If you don’t know how your changes affect performance, and performance is something you care about, you are going to have a bad time.

If you care about performance, measure performance regularly! Profile before and after your changes and compare the differences. Have automated tests that profile your software and report the results. Understand how your code behaves in the best and worst case. Watch out for algorithms that sometimes take a lot longer than their average case. Stable algorithms make for stable experiences (and stable frame rates in games). This is because algorithms that have “perf spikes” sometimes line up on the same frame, and you’ll have more erratic frame rate, which makes your game seem much worse than having a stable but lower frame rate.

But, again, performance problems aren’t always the programmers fault. Sometimes things outside of our control change and cause us perf problems.

Like what you might ask?

Well, let’s say that you are tasked with writing some very light database software which keeps track of all employee’s birthdays.

Maybe you use a hash map to store birthdays. The key is the string of the person’s name, and the value is a unix epoch timestamp.

Simple and to the point. Not over-engineered.

Everything runs quickly, your decisions about the engineering choices you made were appropriate and your software runs great.

Now, someone else has a great idea – we have this database software you wrote, what if we use it to keep track of all of our customers and end user birthdays as well?

So, while you are out on vacation, they make this happen. You come back and the “database” software you made is running super slow. There are hundreds of thousands of people stored in the database, and it takes several seconds to look up a single birthday. OUCH!

So hotshot, looks like your code isn’t so fast huh? Actually no, it’s just that your code was used for something other than the original intended usage case. If this was included in the original specs, you would have done something different (and more complex) to handle this need.

This was an exaggerated example, but this sort of thing happens ALL THE TIME.

If you are working on a piece of software, and the software requirements change, it could turn any of your previous good decisions into poor decisions in light of the new realities.

However, you likely don’t have time to go back and re-think and possibly re-work every single thing you had written up to that point. You move onward and upward, a little more heavy hearted.

The target moved, causing your code to rot a bit, and now things are likely in a less than ideal situation. You wouldn’t have planned for the code you have with the info you have now, but it’s the code you do have, and the code you have to stick with for the time being.

Every time that happens, you incur a little more tech debt / code complexity and likely performance problems as well.

You’ll find that things run a little slower than they should, and that you spend more time fighting symptoms with small changes and somewhat arbitrary rules – like telling people not to use name lengths more than 32 characters for maximum performance of your birthday database.

Unfortunately change is part of life, and very much part of software development, and it’s impossible for anyone to fully predict what sort of changes might be coming.

Those changes are often due to business decisions (feedback on product, jockying for a new position in the marketplace, etc), so are ultimately what give us our paychecks and are ultimately good things. Take it from me, who has worked at ~7 companies in 15 years. Companies that don’t change/adapt die.

So, change sucks for our code, but it’s good for our wallets and keeps us employed 😛

Eventually the less than ideal choices of the past affecting the present will reach some threshold where something will have to be done about it. This will likely happen at the point that it’s easier to refactor some code, than to keep fighting the problems it’s creating by being less than ideal, or when something that really NEEDS to happen CAN’T happen without more effort than the refactor would take.

When that happens, the refactor comes in, where you DO get to go back and rethink your decisions, with knowledge of the current realities.

The great thing about the refactor is that you probably have a lot of stuff that your code is doing which it doesn’t really even NEED to be doing.

Culling that dead functionality feels great, and it’s awesome watching your code become simple again. It’s also nice not having to explain why that section of code behaves the way it does (poorly) and the history of it coming to be. “No really, I do know better, but…!!!”

One of the best feelings as a programmer is looking at a complex chunk of code that has been a total pain, pressing the delete key, and getting a little bit closer back to the fastest code in the world:

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
    return 0;
}

PS: Another quality of a successful engineer is being able to constantly improve software as it’s touched. If you are working in an area of code, and you see something ugly that can be fixed quickly and easily, do it while you are there. Since the only constant in software development is change, and change causes code quality to continually degrade, make yourself a force of continual code improvement and help reverse the flow of the code flowing into the trash can.

Engines

In closing, I want to talk about game engines – 3rd party game engines, and re-using an engine from a different project. This also applies to using middleware.

Existing engines are great in that when you and your team know how to use them, you can get things set up very quickly. It lets you hit the ground running.

However, no engine is completely generic. No engine is completely flexible.

That means that when you use an existing engine, there will be some amount of features and functionality which were made without your specific usage case in mind.

You will be stuck in the world where from day 1 you are incurring the tech debt type problems I describe above, but you will likely be unable or unwilling to refactor everything to suit your needs specifically.

I don’t mention this to say that engines are bad. Lots of successful games have used engines made by other people, or re-used engines from previous projects.

However, it’s a different kind of beast using an existing engine.

Instead of making things that suit your needs, and then using them, you’ll be spending your time figuring out how to use the existing puzzle pieces to do what you want. You’ll also be spending time backtracking as you hit dead ends, or where your first cobbled together solution didn’t hold up to the new realities, and you need to find a new path to success that is more robust.

Just something to be aware of when you are looking at licensing or re-using an engine, and thinking that it’ll solve all your problems and be wonderful. Like all things, it comes at a cost!

Using an existing engine does put you ahead of the curve: At day 1 you already have several months of backlogged technical debt!

Unfortunately business realities mean we can’t all just always write brand new engines all the time. It’s unsustainable :/

Agree / Disagree / Have something to say?

Leave a comment below, or tweet at me on twitter: @Atrix256

Minimizing Code Complexity by Programming Declaratively

Writing good code is something all programmers aspire to, but the definition of what actually makes good code can be a bit tricky to pin down. The idea of good code varies from person to person, from language to language, and also varies between problem domains. Web services, embedded devices and game programming are few software domains that all have very different needs and so also have very different software development styles, methods and best practices.

I truly believe that we are in the stone ages of software development (ok, maybe the bronze age?), and that 100 years from now, people will be doing things radically differently than we do today because they (or we) will have figured out better best practices, and the languages of the day will usher people towards increased success with decreased effort.

This post is on something called declarative programming. The idea is nothing new, as prolog from 1972 is a declarative language, but the idea of declarative programming is something I don’t think is talked about enough in the context of code quality.

By the end of this read, I hope you will agree that programming declaratively by default is a good best practice that pertains to all languages and problem domains. If not, leave a comment and let me know why!

Declarative vs Imperative Programming

Declarative programming is when you write code that says what to do. Imperative programming is when you write code that says how to do it.

Below is some C++ code written imperatively. How long does it take you to figure out what the code is doing?

	int values[4] = { 8, 23, 2, 4 };
	int sum = 0;
	for (int i = 0; i < 4; ++i)
		sum += values[i];
	int temp = values[0];
	for (int i = 0; i < 3; ++i)
		values[i] = values[i + 1];
	values[3] = temp;

Hopefully it didn’t take you very long to understand the code, but you had to read it line by line and reason about what each piece was doing. It may not be difficult, but it wasn’t trivial.

Here is the same code with some comments, which helps it be understandable more quickly, assuming the comments haven’t become out of date (:

	// Declare array
	int values[4] = { 8, 23, 2, 4 };

	// Calculate sum
	int sum = 0;
	for (int i = 0; i < 4; ++i)
		sum += values[i];

	// Rotate array items one slot left.
	int temp = values[0];
	for (int i = 0; i < 3; ++i)
		values[i] = values[i + 1];
	values[3] = temp;

Here is some declarative code that does the same thing:

	int values[4] = { 8, 23, 2, 4 };
	int sum = SumArray(values);
	RotateArrayIndices(values, -1);

The code is a lot quicker and easier to understand. In fact the comments aren’t even needed anymore because the code is basically what the comments were.

Comments are often declarative, saying what to do right next to the imperative code that says how to do it. If your code is also declarative though, there is no need for the declarative comments because they are redundant! In fact, if you decide to start trying to write code more declaratively, one way to do so is if you ever find yourself writing a declarative comment to explain what some code is doing, wrap it in a new function, or see if there is an existing function you ought to be using instead.

As a quick tangent, you can use the newer C++ features to make code more declarative, like the below. You arguably should be doing that when possible, if your code base uses STL, a custom STL implementation, or an in house STL type replacement, but I want to stress that this is a completely separate topic than whether or not we should be using new C++ features. Some folks not used to STL will find the below hard to read compared to the last example, which takes away from the main point. So, if you aren’t a fan of STL due to it’s readability (I agree!), or it’s performance characteristics (I also agree!), don’t worry about it. For people on the other side of the fence, you can take this as a pro STL argument though, as it does make code more declarative, if the readability and perf things aren’t impacting you.

	std::array<int,4> values = { 8, 23, 2, 4 };
	int sum = std::accumulate(values.begin(), values.end(), 0);
	std::rotate(values.begin(), values.begin() + 1, values.end());

We Already Live in a Semi-Declarative World

When reading the tip about using (declarative) comments as a hint for when to break some functionality out into another function, you may be thinking to yourself: “Wait, isn’t that just the rule about keeping functions small, like to a few lines per function?”

Yeah, that is true. There is overlap between that rule and writing declarative code. IMO declarative code is a more general version of that rule. That rule is part of making code declarative, and gives some of the benefits, but isn’t the whole story.

The concept of D.R.Y. “Don’t Repeat Yourself” also ends up causing your code to become more declarative. When you are repeating yourself, it’s often because you are either duplicating code, or because there is boiler plate code that must be added in multiple places to make something work. By applying D.R.Y. and making a single authoritative source of your information or work, you end up taking imperative details out of the code, thus making what remains more declarative. For more information on that check out this post of mine: Macro Lists For The Win

TDD

If your particular engineering culture uses TDD (test driven development), you may also say “Hey, this isn’t actually anything special, this is also what you get when you use TDD.”

Yes, that is also true. Test driven development forces you to write code such that each individual unit of work is broken up into it’s own contained, commonly stateless, function or object.

It’s suggested that the biggest value of TDD comes not from the actual testing, but from how TDD forces you to organize your code into small logical units of work, that are isolatable from the rest of the code.

In other words, TDD forces you to make smaller functions that do exactly what they say by their name and nothing else. Sound familiar? Yep, that is declarative programming.

Compilers, Optimizers and Interpreters

The whole goal of compilers, optimizers and interpreters is to make it so you the coder can be more declarative and less imperative.

Compilers make it so you don’t have to write assembly (assembly being just about as imperative as you can get!). You can instead write higher level concepts about what you want done – like loop over an array and sum up the values – instead of having to write the assembly (or machine code!) to load values into memory or registers, do work, and write them back out to memory or registers.

Similarly, the whole goal of optimizers are to take code where you describe what you want to happen, and find a way to do the equivalent functionality in a faster way. In other words, you give the WHAT and it figures out the HOW. That is declarative programming.

Interestingly, switch statements are declarative as well. You tell the compiler what items to test for at run time but leave it up to the compiler to figure out how to test for them. It turns out that switch statements can decide at compile time whether they want to use binary searching, if/else if statements, or other tricks to try and make an efficient lookup for the values you’ve provided.

Surprised to hear that? Give this a read: Something You May Not Know About the Switch Statement in C/C++

Similarly, profile guided optimization (PGO) is a way for the optimizer to know how your code actually behaves at runtime, to get a better idea at what machine code it ought to generate to be more optimal. Again, it’s taking your more declarative high level instructions, and creating imperative low level instructions that actually handle the HOW of doing what your code wants to do in a better way.

C#

If you’ve spent any time using C#, I’ll bet you’ve come to the same conclusion I have: If it takes you more than one line of code to do a single logical unit of work (read a file into a string, sort a list, etc), then you are probably doing it wrong, and there is probably some built in functionality already there to do it for you.

When used “correctly”, C# really tends to be declarative.

C++ Advancements Towards Being Declarative

In the old days of C, there were no constructors or destructors. This meant that you had to code carefully and initialize, deinitialize things at just the right time.

These were imperative details that if you got wrong, would cause bugs and make a bad day for you and the users of your software.

C++ improved on this problem by adding constructors and destructors. You could now put these imperative details off in another section and then not worry about it in the bulk of the code. C++ made C code more declarative by letting you focus more on the WHAT to do, and less on HOW to do it, in every line of code.

In more recent years, we’ve seen C++ get a lot of upgrades, many of which make C++ more declarative. In other words, common things are now getting language and/or STL library support.

For one, there are many operations built in which people used to do by hand that are now built in – such as std::sort or std::qsort. You no longer have to write out a sorting algorithm imperatively, you just use std::sort and move on.

Another really good example of C++ getting more declarative is lambdas. Lambdas look fancy and new, but they are really just a syntactic shortcut to doing something we could do all along. When you make a lambda, the compiler makes a class for you that overloads the parentheses operator, has storage for your captures and captures those captures. A struct that looks like this is called a functor and has existed for a long time before lambdas ever entered C++. The only difference is that if you want to use a functor now, you don’t have to go through a bunch of nitty gritty imperative details for making your functor class. Now, you just defined a lambda and move on.

Domain Specific Languages

Domain specific languages – aka DSLs – exist to let people write code meant for specific purposes. Some examples of DSLs are:

  • HTML – what we use to make static webpages
  • XSLT – a language to transform XML data into other data
  • SQL – a language to query information from databases
  • Regex – a language to search text

Believe it or not, DSL is a synonym of declarative programming languages.

HTML for instance completely cuts out things like loops, memory allocation and image loading, and lets you just specify how a web page should look. HTML is declarative because you deal only with the issues in the problem space, not with the imperative details of how to make it all happen.

It’s similar for the others in the list, and other DSLs not on the list. They all try to remove complexity you don’t care about to try and distill a language that deals only with the things in the problem space.

Our Job As Programmers

As programmers, it’s only one part of our job to make “good code” that is easy to read and easy to maintain, but many non programmers would laugh to hear that we spend so much time thinking about that.

The other part of our job is the end result of what our program does. This is what non programmers focus more heavily on of course, and is ultimately what makes software successful or not – at least in the short term. Software needs to do good stuff well to be successful, but if you don’t make good code, you are going to sink your business in bugs, inflexibility, maintenance costs, etc.

Programmers mainly either write code for use by other programmers (such as libraries and APIs), or they make software for use by other people.

In both cases, the job is really that we are trying to hide away imperative details (implementation complexity) and give our customers a domain specific language to do what they want to do in the easiest and best way possible. It’s very important in both cases that the users of your API or the users of your software don’t have to deal with things outside the problem space. They want to work declaratively, saying only what to do, and have our software handle the imperative details of how to do it. They paid for the software so they didn’t have to deal with those details.

As an example, when you work in an excel spreadsheet and it does an average of a row of columns, it doesn’t make you decide whether it should use SIMD instructions to do the math or scalar instructions. You don’t really care, and it almost certainly doesn’t matter enough to anyone using excel which to do, so excel just does whatever it does internally to give you what you asked for.

It can be a challenge knowing what needs to be hidden away when making an API or software for users, but that comes from understanding what it is that your customers actually need and what they are trying to do, which is already a super important step.

The good news is that you don’t have to perfectly match the customers needs to improve their lives. Any imperative details that you can remove is a win. I’m not talking about taking away abilities that people want and should have, I’m talking about removing “chores”, especially ones that if done wrong can cause problems – like nulling out a pointer after deleting it, or initializing all members of a class when creating an object, or the details of loading an image into memory.

None of this should really be that surprising to anyone, but hopefully thinking of these things in a declarative vs imperative way formalizes and generalizes some ideas.

Why Wouldn’t You Program Declaratively?

Purely declarative programming means that you only describe the things you care about and nothing else. If this is true, why wouldn’t you ALWAYS program declaratively? In fact, why do imperative languages even exist? Why would you ever want to waste time dealing with what you by definition did not care about?

Well, for one, it’s impossible to nail down what it is exactly that people do and do not care about, especially in something like a programming language which is likely to be used for lots of different purposes by lots of different people. It’s been real eye opening seeing the diverse needs of the C++ community in recent years for instance. As a C++ game programmer, surrounded by primarily C++ game programmers, I thought I knew what the language needed, but there are lots of things I never considered because I don’t have a need for, unlike some other C++ programmers out there.

Another big point is that declarative languages by definition are a sort of black box. You tell it what to do but not how. It has to figure out the details of how to do it in a good way. The problem is that the compiler (or similar process) has limited abilities to make these choices, and also has limited information about the problem space.

For instance, a declarative language may let you work with a set and say “put item X into the set” and “does item Y exist in this set?”. You can imagine it could perhaps use a hash table, where each hash bucket was a linked list of values. This way, when you queried if the item Y was in the set, it could hash it, then do comparisons against whatever items were in that bucket.

That implementation is fairly reasonable for many programs.

What if instead, you want to keep a set of unique visitors to a popular website, like say google.com? That set is going to use a ton of memory and/or be very slow because it’s going to be HUGE.

In that case, someone is likely to go with a probabilistic algorithm perhaps (maybe a bloom filter), where it’s ok that the answer isn’t exactly right, because the memory usage and computation time drops off significantly going probabilistic, and actually makes the feature possible.

The declarative language is likely not going to be smart enough to figure out that it should use a probabilistic algorithm, and nobody ever told it that it could.

Sure, you could add probabilistic set support to the declarative language, and people could specifically ask for it when they need it (they care about it, so it should be part of the DSL), but we could make this argument about many other things. The point is just that without super smart AI and lots more information (and freedom to make decisions independently of humans), a declarative language is going to be pretty limited in how well it can do in all situations.

Because of this, it’s important for the programmer to be able to profile processing time and other resource usage, and be able to “go imperative” where needed to address any problems that come up.

This is similar to how when writing C++, when we REALLY NEED TO, we can write some inline assembly. The C++ is the more declarative language, that allows you to write more imperative assembly when you need to.

It’s important to note that I’m not saying that declarative programming is inherently slower than imperative programming though. Declarative languages can actually be much faster and more efficient with resources believe it or not. In the example at the beginning of the post where i used std::rotate to replace a loop that moved items in an array, it’s possible that std::rotate uses a memmove to move the bulk of the items, instead of an item by item copy like what I coded. That would be a much better solution, especially for large array sizes.

So, declarative programming isn’t necessarily slower than imperative programming, but, for the times it isn’t doing well enough, we need a way to turn off “auto pilot” mode and give imperative instructions for how to do something better.

In more human terms: If you asked someone to go get the mail, you likely will say “can you get my mail? here’s the key and it’s in box 62.”. You wouldn’t tell the person how to walk to the door, open it, walk out, close it, etc. However, if there were special instructions such as “please check the package locker too”, you would give those details.

Basically, you give only the details that are needed, as simply as possible, but reserve the right to give as fine grained details as needed, when they are needed.

So, i propose this:

  • We as programmers ought to be programming declaratively by default, only resorting to imperative programming when we need to.
  • Our job is to empower our customers to work declaratively by making them good DSLs (aka interfaces), but we should remember that it might be important to let them also go more imperative when needed.

Thanks for reading, and let me know your thoughts!

Links

Here are some interesting links about managing code complexity and writing high quality code:
Youtube: Simplicity Matters (36 minutes)
Functions Should Be Short And Sweet, But Why?
Bitsquid: Managing Coupling
Thoughts on Declarative and Imperative Languages
Declarative vs. Imperative Programming for the Web

Low Tech Homomorphic Encryption

Homomorphic encryption is a special type of encryption that lets you do calculations on encrypted values as if they weren’t encrypted. One reason it’s desired is that secure computing could be done in the cloud, if practical homomorphic encryption were available.

Homomorphic encryption has been a hot research topic since 2009, when Craig Gentry figured out a way to do it while working on his PhD. Since then, people have been working on making it better, faster and more efficient.

You can read more about a basic incarnation of his ideas in my blog posts:
Super Simple Symmetric Leveled Homomorphic Encryption Implementation
Improving the Security of the Super Simple Symmetric Leveled Homomorphic Encryption Implementation

This post is about a low tech type of homomorphic encryption that anyone can easily do and understand. There is also some very simple C++ that implements it.

This idea may very well be known about publically, but I’m not aware of any places that talk about it. I may just be ignorant of them though so ::shrug::

Quick Update

I’ve gotten some feedback on this article, the most often feedback being that this is obfuscation not encryption. I think that’s a fair assessment as the secret value you are trying to protect is in no way transformed, but is just hidden. This post could easily be titled Homomorphic Obfuscation, and perhaps should be.

To see other feedback and responses to this post, check out the reddit links at the bottom!

The Idea

The idea is actually super simple:

  1. Take the value you want to encrypt.
  2. Hide it in a list of a bunch of other random values, and remember where it is in the list. The position in the list is your key.
  3. Send this list to an untrusted party.
  4. They do the same calculation on every item in the list and send it back.
  5. Since you know which value was your secret value, you know which answer is the one you care about.

At the end of that process, you have the resulting value, and they have no idea what value was your secret value. You have done, by definition, homomorphic encryption!

There is a caveat of course… they know that your secret value was ONE of the values on the list.

Security Details

The thing here is that security is a sliding scale between resource usage (computation time, RAM, network bandwidth, etc) and security.

The list size is your security parameter in this case.

A larger list of random values means that it takes longer to transfer the data, more memory to store it, it takes longer to do the homomorphic computations, but the untrusted party is less sure about what your secret value is.

On the other hand, a shorter list is faster to transmit, easier to store, quicker to compute with, but the untrusted party has a better idea what your secret value is.

For maximal security you can just take this to the extreme – if your secret value is a 32 bit floating point number, you could make a list with all possible 2^32 floating point numbers in it, have them do the calculation and send it back. You can even do an optimization here and not even generate or send the list, but rather just have the person doing the calculations generate the full 2^32 float list, do the calculations, and send you the results.

That gets pretty big pretty fast though. That list would actually be 16 gigabytes, but the untrusted party would know almost nothing about your value, other than it can be represented by a 32 bit floating point number.

Depending on your security needs, you might be ok with shortening your list a bit to bring that number down. Making your list only be one million numbers long (999,999 random numbers and one value you actually care about), your list is only 3.8 megabytes.

Not quite as bad.

Some Interesting Abilities

Using this homomorphic encryption, like other homomorphic encryption, you can do computation involving multiple encrypted values. AKA you could multiply two encrypted values together. To do this, you are going to need to encrypt all values involved using the same key. In other words, they are going to have to be at the same index in each of their respective lists of random numbers.

Something else that is interesting is that you can also encode MULTIPLE secret values in your encrypted value list. You could have 1 secret value at index 50 and another at index 100 for instance. Doing this, you get a sort of homomorphic SIMD setup.

Homomorphic SIMD is actually a real thing in other homomorphic encryption methods as well. Check out this paper for instance:
Fully Homomorphic SIMD Operations

The only problem with homomorphic SIMD is that adding more secret values to the same encrypted list decreases the security, since there are more values in the list that you don’t want other people to know about.

You can of course also modify encrypted values by unencrypted values. You could multiply an encrypted value by 3, by multiplying every value in the list by 3.

Extending to Public Key Cryptography

If you wanted to use asymmetric key cryptography (public/private keys) instead of symmetric key cryptography, that is doable here too.

What you would do is have the public key public as per usual, and that key would be used in a public key algorithm to encrypt the index of the secret value in the random list.

Doing this, the person who has the private key would be able to receive the list and encrypted index, decrypt the index, and then get the secret value out using that index.

Sample Code Tests

The sample code only does Symmetric key encryption, and does these 3 tests:

  1. Encrypts two floating point numbers into a single list, SIMD style, does an operation on the encrypted values, then unencrypts and verifies the results.
  2. Does the same with two sets of floats (three floats in each set), to show how you can make encrypted values interact with each other. Does the operation, then unencrypts and verifies the results.
  3. Encrypts three values of a 3 byte structure, does an operation on the encrypted values, then unencrypts and verifies the results.

All secret data was hidden in lists of 10,000,000 random values. That made the first two tests (the ones done with 4 byte floats) have encrypted files of 38.1MB (40,000,000 bytes), and the last test (the one done with a 3 byte struct) had a file size of 28.6 MB (30,000,000 bytes).

Here are the timing of the above tests:

Sample Code

Here is the header, LTHE.h:

/*

Written by Alan Wolfe
http://blog.demofox.org


*/

#pragma once
#include <vector>
#include <random>

// A static class with template functions in it.
// A namespace would be nice, except I want to hide some things as private.
class LTHE
{
public:

    //=================================================================================
    template <typename T>
    static bool Encrypt (std::vector<T> values, size_t listSize, const char* fileName, std::vector<size_t>& keys, bool generateKeys = true)
    {
        // Make sure we have a list that is at least as long as the values we want to encrypt
        if (values.size() > listSize)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): values.size() > listSize.n");
            return false;
        }

        // Generate a list of keys if we are told to
        // Ideally you want to take the first M items of a cryptographically secure shuffle
        // of N items.
        // This could be done with format preserving encryption or some other method
        // to make it not roll and check, and also more secure random.
        if (generateKeys)
        {
            keys.clear();
            for (size_t i = 0, c = values.size(); i < c; ++i)
            {
                size_t newKey;
                do
                {
                    newKey = RandomInt<size_t>(0, listSize - 1);
                }
                while (std::find(keys.begin(), keys.end(), newKey) != keys.end());
                keys.push_back(newKey);
            }
        }

        // make a file of random values, size of T, count of <listSize> 
        FILE *file = fopen(fileName, "w+b");
        if (!file)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", fileName);
            return false;
        }

        // Note: this may not be the most efficient way to generate this much random data or 
        // write it all to the file.
        // In a real crypto usage case, you'd want a crypto secure random number generator.
        // You'd also want to make sure the random numbers had the same properties as your
        // input values to help anonymize them better.
        // Like if your numbers are not whole numbers, you don't want to generate only whole numbers.
        // Or if your numbers are salaries, you may not want purely random values, but more "salaryish"
        // looking numbers.
        // You could alternately just do all 2^N possible values which would definitely anonymize
        // the values you wanted to encrypt.  This is maximum security, but also takes most
        // memory and most processing time.
        size_t numUint32s = (listSize * sizeof(T)) / sizeof(uint32_t);
        size_t numExtraBytes = (listSize * sizeof(T)) % sizeof(uint32_t);
        for (size_t i = 0; i < numUint32s; ++i)
        {
            uint32_t value = RandomInt<uint32_t>();
            if (fwrite(&value, sizeof(value), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write random numbers (uint32s).n");
                fclose(file);
                return false;
            }
        }
        for (size_t i = 0; i < numExtraBytes; ++i)
        {
            uint8_t value = RandomInt<uint8_t>();
            if (fwrite(&value, sizeof(value), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write random numbers (extra bytes).n");
                fclose(file);
                return false;
            }
        }

        // Now put the values in the file where they go, based on their key
        for (size_t i = 0, c = values.size(); i < c; ++i)
        {
            long pos = (long)(keys[i] * sizeof(T));
            if (fseek(file, pos, SEEK_SET) != 0)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not fseek.n");
                fclose(file);
                return false;
            }
            if (fwrite(&values[i], sizeof(values[i]), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write secret value.n");
                fclose(file);
                return false;
            }
        }

        // close file and return success
        fclose(file);
        return true;
    }

    //=================================================================================
    template <typename T, typename LAMBDA>
    static bool TransformHomomorphically (const char* srcFileName, const char* destFileName, const LAMBDA& function)
    {
        // open the source and dest file if we can
        FILE *srcFile = fopen(srcFileName, "rb");
        if (!srcFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", srcFileName);
            return false;
        }
        FILE *destFile = fopen(destFileName, "w+b");
        if (!destFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", destFileName);
            fclose(srcFile);
            return false;
        }

        // Process the data in the file and write it back out.
        // This could be done much better.
        // We could read more from the file at once.
        // We could use SIMD.
        // We could go multithreaded.
        // We could do this on the GPU for large data sets and longer transformations! Assuming data transfer time isn't too prohibitive.
        // We could decouple the disk access from processing, so it was reading and writing while it was processing.
        const size_t c_bufferSize = 1024;
        std::vector<T> dataBuffer;
        dataBuffer.resize(c_bufferSize);
        size_t elementsRead;
        do
        {
            // read data from the source file
            elementsRead = fread(&dataBuffer[0], sizeof(T), c_bufferSize, srcFile);

            // transform the data
            for (size_t i = 0; i < elementsRead; ++i)
                dataBuffer[i] = function(dataBuffer[i]);

            // write the transformed data to the dest file
            if (fwrite(&dataBuffer[0], sizeof(T), elementsRead, destFile) != elementsRead)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write transformed elements.n");
                fclose(srcFile);
                fclose(destFile);
                return false;
            }
        }
        while (!feof(srcFile));

        // close files and return success
        fclose(srcFile);
        fclose(destFile);
        return true;
    }

    //=================================================================================
    template <typename T, typename LAMBDA>
    static bool TransformHomomorphically (const char* src1FileName, const char* src2FileName, const char* destFileName, const LAMBDA& function)
    {
        // open the source and dest file if we can
        FILE *srcFile1 = fopen(src1FileName, "rb");
        if (!srcFile1)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", src1FileName);
            return false;
        }
        FILE *srcFile2 = fopen(src2FileName, "rb");
        if (!srcFile2)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", src2FileName);
            fclose(srcFile1);
            return false;
        }
        FILE *destFile = fopen(destFileName, "w+b");
        if (!destFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", destFileName);
            fclose(srcFile1);
            fclose(srcFile2);
            return false;
        }

        // Process the data in the file and write it back out.
        // This could be done much better.
        // We could read more from the file at once.
        // We could use SIMD.
        // We could go multithreaded.
        // We could do this on the GPU for large data sets and longer transformations! Assuming data transfer time isn't too prohibitive.
        // We could decouple the disk access from processing, so it was reading and writing while it was processing.
        const size_t c_bufferSize = 1024;
        std::vector<T> dataBuffer1, dataBuffer2;
        dataBuffer1.resize(c_bufferSize);
        dataBuffer2.resize(c_bufferSize);
        size_t elementsRead1;
        size_t elementsRead2;
        do
        {
            // read data from the source files
            elementsRead1 = fread(&dataBuffer1[0], sizeof(T), c_bufferSize, srcFile1);
            elementsRead2 = fread(&dataBuffer2[0], sizeof(T), c_bufferSize, srcFile2);

            if (elementsRead1 != elementsRead2)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Different numbers of elements in each file!n");
                fclose(srcFile1);
                fclose(srcFile2);
                fclose(destFile);
                return false;
            }

            // transform the data
            for (size_t i = 0; i < elementsRead1; ++i)
                dataBuffer1[i] = function(dataBuffer1[i], dataBuffer2[i]);

            // write the transformed data to the dest file
            if (fwrite(&dataBuffer1[0], sizeof(T), elementsRead1, destFile) != elementsRead1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write transformed elements.n");
                fclose(srcFile1);
                fclose(srcFile2);
                fclose(destFile);
                return false;
            }
        }
        while (!feof(srcFile1));

        // close files and return success
        fclose(srcFile1);
        fclose(srcFile2);
        fclose(destFile);
        return true;
    }

    //=================================================================================
    template <typename T>
    static bool Decrypt (const char* fileName, std::vector<T>& values, std::vector<size_t>& keys)
    {
        // Open the file if we can
        FILE *file = fopen(fileName, "rb");
        if (!file)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", fileName);
            return false;
        }

        // Read the values from the file.  The key is their location in the file.
        values.clear();
        for (size_t i = 0, c = keys.size(); i < c; ++i)
        {
            long pos = (long)(keys[i] * sizeof(T));
            if (fseek(file, pos, SEEK_SET) != 0)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not fseek.n");
                fclose(file);
                return false;
            }
            T value;
            if (!fread(&value, sizeof(T), 1, file))
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not decrypt value for key.n");
                fclose(file);
                return false;
            }
            values.push_back(value);
        }

        // Close file and return success
        fclose(file);
        return true;
    }

private:
    template <typename T>
    static T RandomInt (T min = std::numeric_limits<T>::min(), T max = std::numeric_limits<T>::max())
    {
        static std::random_device rd;
        static std::mt19937 mt(rd());
        static std::uniform_int<T> dist(min, max);
        return dist(mt);
    }
};

And here is the test program, main.cpp:

#include <stdio.h>
#include "LTHE.h"
#include <chrono>

//=================================================================================
// times a block of code
struct SBlockTimer
{
    SBlockTimer()
    {
        m_start = std::chrono::high_resolution_clock::now();
    }

    ~SBlockTimer()
    {
        std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
        printf("    %0.2f secondsn", seconds.count());
    }

    std::chrono::high_resolution_clock::time_point m_start;
};

//=================================================================================
float TransformDataUnitary (float& value)
{
    return (float)sqrt(value * 2.17f + 0.132);
}

//=================================================================================
float TransformDataBinary (float& value1, float value2)
{
    return (float)sqrt(value1 * value1 + value2 * value2);
}

//=================================================================================
struct SStruct
{
    uint8_t x, y, z;

    static SStruct Transform (const SStruct& b)
    {
        SStruct ret;
        ret.x = b.x * 2;
        ret.y = b.y * 3;
        ret.z = b.z * 4;
        return ret;
    }

    bool operator != (const SStruct& b) const
    {
        return b.x != x || b.y != y || b.z != z;
    }
};

//=================================================================================
int Test_FloatUnitaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<float> secretValues = { 3.14159265359f, 435.0f };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues, 10000000, "Encrypted.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<float>("Encrypted.dat", "Transformed.dat", TransformDataUnitary))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<float> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues.size(); i < c; ++i)
        {
            if (TransformDataUnitary(secretValues[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int Test_FloatBinaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<float> secretValues1 = { 3.14159265359f, 435.0f, 1.0f };
    std::vector<float> secretValues2 = { 1.0f, 5.0f, 9.0f };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues1, 10000000, "Encrypted1.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
        if (!LTHE::Encrypt(secretValues2, 10000000, "Encrypted2.dat", keys, false)) // reuse the keys made for secretValues1
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<float>("Encrypted1.dat", "Encrypted2.dat", "Transformed.dat", TransformDataBinary))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<float> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues1.size(); i < c; ++i)
        {
            if (TransformDataBinary(secretValues1[i], secretValues2[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int Test_StructUnitaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<SStruct> secretValues = { {0,1,2},{ 3,4,5 },{ 6,7,8 } };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues, 10000000, "Encrypted.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<SStruct>("Encrypted.dat", "Transformed.dat", SStruct::Transform))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<SStruct> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues.size(); i < c; ++i)
        {
            if (SStruct::Transform(secretValues[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int main (int argc, char **argv)
{
    // test doing an operation on a single encrypted float
    int ret = Test_FloatUnitaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }

    // test doing an operation on two encrypted floats
    ret = Test_FloatBinaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }

    // test doing an operation on a single 3 byte struct
    ret = Test_StructUnitaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }
    
    printf("nAll Tests Passed!nn");
    system("pause");
    return 0;
}

If you found this post interesting or useful, or you have anything to add or talk about, let me know!

Reddit discussion:
r/programming
r/cryptography

Is Code Faster Than Data? Examining Hash Tables

This series of posts is aimed at examining if and how ad hoc code crafted for a specific static (unchanging / constant) data set can run faster than typical generic run time data structures. I think the answer is an obvious “yes, we can often do better”, but these posts explore the details of the problem space and explore how and when we might do better.

The last post explored switch statement performance compared to array access performance.

A switch statement is just a way of telling the compiler how we want to map integral inputs to either some code to run, or some value to return. It’s up to the compiler how to make that happen.

Because compiler makers presumably want to make their compiler generate fast code, it seems like a switch statement should be able to match the speed of an array since a switch statement could very well be implemented as an array when all it is doing is returning different values based on input. Maybe it could even beat an array, in the case of a sparse array, an array with many duplicates, or other situations.

In practice, this doesn’t seem to be the case though, and switch statements are actually quite a bit slower than arrays from the experimentation I’ve done. The main part of the overhead seems to be that it always does a jump (goto) based on the input you are switching on. It can do some intelligent things to find the right location to jump to, but if all you are doing is returning a value, it doesn’t seem smart enough to do a memory read from an array and return that value, instead of doing a jump.

You can read a nice analysis on how switch statements are compiled on the microsoft compiler here: Something You May Not Know About the Switch Statement in C/C++.

Today we are going to be analyzing how hash tables fare against switch statements, arrays, and a few other things.

Testing Details

I ran these tests in x86/x64 debug/release in visual studio 2015.

I got a list of 100 random words from http://www.randomwordgenerator.com/ and made sure they were all lowercase. I associated an integer value with them, from 1 to 100. My tests are all based on the string being the key and the integer being the value.

I have that data stored/represented in several ways for performing lookups:

  1. std::map.
  2. std::unordered_map.
  3. std::unordered_map using crc32 hash function.
  4. std::unordered_map using crc32 hash function modulo 337 and salted with 1147287 to prevent collisions.
  5. SwitchValue() switches on crc32 of input string.
  6. SwitchValueValidate() switches on crc32 of input string but does a single strcmp to handle possibly invalid input.
  7. SwitchValueMinimized() switches on crc32 of input string modulo 337 and salted with 1147287 to prevent collisions.
  8. SwitchValueMinimizedValidate() like SwitchValueMinimized() but does a single strcmp to handle possibly invalid input.
  9. g_SwitchValueMinimizedArray, the array version of SwitchValueMinimized().
  10. g_SwitchValueMinimizedArrayValidate, the array version of SwitchValueMinimizedValidate().
  11. BruteForceByStartingLetter() switches on first letter, then brute force strcmp’s words beginning with that letter.
  12. BruteForce() brute force strcmp’s all words.

The non validating switch statement functions have an __assume(0) in their default case to remove the overhead of testing for invalid values. This is to make them as fast as possible for the cases when you will only be passing valid values in. If ever that contract was broken, you’d hit undefined behavior, so the performance boost comes at a cost. The Validate versions of the switch functions don’t do this, as they are meant to take possibly invalid input in, and handle it gracefully. Both validating and not validating input are common use cases so I wanted to represent both in the performance analysis.

Here are the tests done:

  1. In Order – looks up all strings in order and sums the associated values.
  2. Shuffled – looks up all strings in random order and sums the associated values.
  3. Pre-Hashed Keys In Order – looks up all strings in order and sums the associated values, using pre-hashed keys.
  4. Pre-Hashed Keys Shuffled – looks up all strings in random order and sums the associated values, using pre-hashed keys.

The second two tests only apply to the value lookups which can take pre-hashed keys. For instance, g_SwitchValueMinimizedArray can be indexed by a key that was hashed before the program ran, but a std::unordered_map cannot be indexed by a hash value that was calculated in advance.

Each of those tests were done 5,000 times in a row to make performance differences stand out more, and that full amount of time is the time reported. That process was done 50 times to give both an average (a mean) and a standard deviation to show much much the time samples differed.

The source code for the tests can be found here:
Github: Atrix256/RandomCode/HashVsSwitch

Results

Here are the results, in milliseconds. The values in parentheses are the standard deviations, which are also in milliseconds.

In Order

Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.

Debug Release
Win32 x64 Win32 x64
std::map 7036.77 (126.41) 7070.18 (155.49) 33.02 (2.68) 35.40 (1.43)
std::unordered_map 4235.31 (24.41) 4261.36 (145.16) 19.97 (0.45) 20.12 (0.62)
std::unordered_map crc32 4236.38 (80.72) 4275.36 (116.65) 24.36 (0.47) 23.47 (0.86)
std::unordered_map crc32 minimized 4034.50 (12.72) 4323.67 (170.55) 26.39 (0.50) 23.68 (0.71)
SwitchValue() 123.28 (0.98) 144.29 (4.91) 6.81 (0.30) 5.47 (0.29)
SwitchValueValidate() 127.59 (1.22) 147.41 (5.20) 8.84 (0.35) 7.99 (0.36)
SwitchValueMinimized() 128.83 (0.95) 151.48 (4.66) 8.28 (0.38) 10.18 (0.37)
SwitchValueMinimizedValidate() 132.44 (1.02) 159.85 (6.73) 12.65 (0.40) 10.89 (0.36)
g_SwitchValueMinimizedArray 104.15 (1.13) 122.94 (5.98) 7.68 (0.36) 6.08 (0.36)
g_SwitchValueMinimizedArrayValidate 107.75 (1.07) 120.75 (2.80) 10.49 (0.37) 8.95 (0.32)
BruteForceByStartingLetter() 19.92 (0.63) 22.01 (0.86) 4.85 (0.24) 5.81 (0.26)
BruteForce() 118.65 (1.09) 140.20 (2.28) 31.53 (0.56) 46.47 (0.83)

Shuffled

Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.

Debug Release
Win32 x64 Win32 x64
std::map 7082.92 (214.13) 6999.90 (193.82) 32.14 (0.59) 34.20 (0.62)
std::unordered_map 4155.85 (133.00) 4221.84 (124.70) 20.21 (0.42) 20.09 (0.47)
std::unordered_map crc32 4286.44 (95.39) 4300.81 (64.37) 24.55 (0.57) 23.06 (0.57)
std::unordered_map crc32 minimized 4186.27 (75.35) 4111.73 (43.36) 26.36 (0.56) 23.65 (0.54)
SwitchValue() 127.93 (3.85) 137.63 (1.31) 6.97 (0.32) 5.47 (0.27)
SwitchValueValidate() 131.46 (2.34) 141.38 (1.47) 8.92 (0.38) 7.86 (0.37)
SwitchValueMinimized() 133.03 (2.93) 145.74 (1.50) 9.16 (0.37) 10.50 (0.41)
SwitchValueMinimizedValidate() 135.47 (2.27) 151.58 (1.48) 12.13 (0.40) 10.13 (0.43)
g_SwitchValueMinimizedArray 106.38 (2.70) 118.61 (3.73) 8.18 (0.31) 5.19 (0.29)
g_SwitchValueMinimizedArrayValidate 109.32 (2.34) 120.94 (3.02) 10.49 (0.55) 9.00 (0.40)
BruteForceByStartingLetter() 20.45 (0.92) 21.64 (0.76) 4.90 (0.31) 5.87 (0.32)
BruteForce() 120.70 (2.16) 140.95 (1.71) 32.50 (0.47) 45.90 (0.79)

Pre-hashed In Order

Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.

Debug Release
Win32 x64 Win32 x64
SwitchValue() 12.49 (0.61) 13.23 (0.37) 1.94 (0.17) 1.81 (0.12)
SwitchValueValidate() 17.08 (1.06) 16.72 (0.57) 4.32 (0.30) 4.05 (0.21)
SwitchValueMinimized() 11.83 (0.69) 12.06 (0.51) 1.29 (0.13) 1.58 (0.17)
SwitchValueMinimizedValidate() 16.02 (0.84) 15.84 (0.66) 3.25 (0.24) 3.47 (0.27)
g_SwitchValueMinimizedArray 1.23 (0.06) 1.15 (0.10) 0.00 (0.00) 0.00 (0.00)
g_SwitchValueMinimizedArrayValidate 4.21 (0.32) 2.99 (0.20) 2.45 (0.17) 2.66 (0.20)

Pre-hashed Shuffled

Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.

Debug Release
Win32 x64 Win32 x64
SwitchValue() 12.96 (1.37) 13.45 (0.47) 1.84 (0.11) 1.81 (0.16)
SwitchValueValidate() 16.27 (2.01) 16.57 (0.63) 2.65 (0.19) 2.85 (0.17)
SwitchValueMinimized() 11.75 (0.63) 12.15 (0.45) 1.07 (0.07) 1.06 (0.11)
SwitchValueMinimizedValidate() 16.44 (0.99) 16.33 (0.58) 3.43 (0.18) 3.41 (0.22)
g_SwitchValueMinimizedArray 1.13 (0.06) 1.18 (0.10) 0.32 (0.05) 0.31 (0.04)
g_SwitchValueMinimizedArrayValidate 4.50 (0.32) 3.31 (0.18) 2.82 (0.16) 3.29 (0.18)

Observations

There’s a lot of data, but here’s the things I found most interesting or relevant to what I’m looking at (generic data structures vs ad hoc code for data).

Tests 1 and 2

std::map and std::unordered map are very, very slow in debug as you might expect. It would be interesting to look deeper and see what it is that they are doing in debug to slow them down so much.

There is some tribal knowledge in the C++ world that says to not use std::map and to use std::unordered_map instead, but I was surprised to see just how slow std::map was. in x64 release, std::map took about 75% the time that brute force did, and in win32 release, it took the same time or was slower! std::map isn’t hash based, you give it a comparison function that returns -1,0, or 1 meaning less than, equal or greater than. Even so, you have to wonder how the heck the algorithm can be so slow that brute force is a comparable replacement for lookup times!

It’s interesting to see that everything i tried (except brute force) was significantly faster than both std::map and std::unordered_map. That saddens me a little bit, but to be fair, the usage case I’m going after is a static data structure that has fast lookup speeds, which isn’t what unordered_map aims to solve. This just goes to show that yes, if you have static data that you want fast lookup times for, making ad hoc code or rolling your own read only data structure can give you significant wins to performance, and also can help memory issues (fragmentation and wasted allocations that will never be used).

It was surprising to see that switching on the first letter and brute forcing the strings with the same first letter did so well. That is one of the faster results, competing with SwitchValue() for top dog. The interesting thing though is that BruteForceByStartingLetter() gracefully handles invalid input, while SwitchValue() does not and has undefined behavior, so another point goes to BruteForceByStartingLetter().

Tests 3 and 4

These tests were done with pre-hashed keys to simulate an ideal setup.

If you have a static key to value data structure and have the ability to make ad hoc code for your specific static data, chances are pretty good that you’ll also be able to pre-hash whatever keys you are going to be looking up so you don’t have to hash them at run time. Also, if you are doing multiple lookups with a single key for some reason, you may opt to calculate the hash only on the first lookup, and then from there re-use the hashed key.

These tests simulated those situations.

As expected, the perf results on these tests are much better than those that hash the key on demand for each lookup. Less work done at runtime means better performance.

Based on the results of the last blog post – that array lookups are super fast – you probably aren’t surprised to see that g_SwitchValueMinimizedArray is the winner for performance by far.

It is so fast that the in order case doesn’t even register any time having been taken. This is probably a little misleading, because doing the in order tests (and even the shuffled tests) are very cache friendly. In reality, you probably would have more cache misses and it wouldn’t be quite as cheap as what is being reported, but would still be super fast compared to the other options.

In second place comes SwitchValueMinimized() which is the switch statement function version of g_SwitchValueMinimizedArray. Arrays still beat switch statements, as we found out in the last post!

In third place comes SwitchValue(), which is the same as SwitchValueMinimized() but has sparser values used in the switch statement, which make it more difficult for the compiler to generate efficient code. For instance, having the full range of 32 bits as case statement values, and having them all be pseudo random numbers (because they are the result of a hash!) rules out the ability for the compiler to make a jump table array, or find any patterns in the numbers. The SwitchValueMinimized() function on the other hand has only 337 possible values, and so even though the values are sparse (there are 100 items in those 337 possible values), it’s a small enough number that a jump table array could be used without issues.

After that comes all the validated versions of the tests. It makes sense that they would be slower, because they do all the same work, and then some additional work (strcmp) to ensure that the input is valid.

Getting The Fastest Results

If you have some static data that maps keys to values, and you need it to be fast for doing lookups, it looks like writing something custom is the way to go.

The absolutely fastest way to do it is to make an array out of your data items and then pre-process (or compile time process) any places that do a lookup, to convert keys to array indices. then, at run time, you only need to do an array lookup to a known index to get your data, which is super fast. If your data has duplicates, you might also be able to make the keys which point at duplicate data instead just point at the same array index, to de-duplicate your data.

If doing that is too complex, or too much work, a low tech and low effort way to handle the problem seems to be to break your data up into buckets, possibly based on their first letter, and then doing brute force (or something else) to do the lookup among the fewer number of items.

In fact, that second method is sort of like a hard coded trie which is only two levels deep.

If you needed to do some hashing at runtime, finding a faster hash function (that also worked in constexpr, or at least had a constexpr version!) could help you get closer to the pre-hashed keys timings. The good news is the hash doesn’t have to be particularly good. It just has to be fast and have no collisions for the input values you wish to use. That seems like something where brute force searching simple hash functions with various salt values may give you the ideal solution, but probably would take a very, very long time to find what you are looking for. You might notice that the default hash used for std::unordered_map is actually faster than the crc32 implementation I used.

Of course, we also learned what NOT to do. Don’t use brute force, and don’t use std::map. Using std::unordered_map isn’t super aweful compared to those solutions, but you can do a lot better if you want to.

Why This?

This fast key to value lookup might sound like a contrived need to some people, but in game development (I am a game developer!), there is commonly the concept of a game database, where you look up data about things (how much damage does this unit do?) by looking up a structure based on a unique ID that is a string, named by a human. So, in game dev, which also has high performance needs, optimizing this usage case can be very helpful. There is a little bit more talk about game data needs here: Game Development Needs Data Pipeline Middleware.

Is Code Faster Than Data?

I still think ad hoc code for data structures can often be faster than generic data structures, and the experiments on this post have positive indications of that.

Another way I think ad hoc code could be helpful is when you have hierarchical and/or heterogeneous data structures. By that I mean data structures which have multiple levels, where each level may actually have different needs for how best to look up data in it, and in fact, even siblings on the same level maybe have different needs for how best to look up data in it.

In these cases, you could make some data types which had virtual functions to handle the nature of the data needing different solutions at each level, but those virtual function calls and abstractions add up.

I think it’d be superior to have hard coded code that says “oh, you want index 4 of the root array? ok, that means you are going to binary search this list next”. Of course, that code needs to be generated by another program to be effective. If a human has to make sure all that code stays up to date, it’ll be a ton of work, and it’ll be broken, making very subtle hard to reproduce bugs.

A downside I can see to ad hoc code solutions is possibly thrashing the instruction cache more. Not sure if that’d be an issue in practice, it’d be interesting to try more complex data structures and see how it goes.

Also, it might be difficult to have good heuristics to figure out what is best in which situations. I could see a utility possibly generating different variations of code and running them to see which was most performant. Seems like it’d be a challenge to get 100% right all the time, but our experiments make it seems like it’d be easy to do significantly better than generic algorithms which are meant to be dynamic at runtime.

I also think that more complex data structures are more likely to get benefit of having custom code made for them. Simple ones less likely so. It’s hard to beat an array lookup. That being said, the unbeatable simple data structures make great building blocks for the more complex ones (;

It probably would also be good to look into memory usage a bit more to see how ad hoc code compares to generic algorithms. If ad hoc code is much faster but uses more memory, that’d have to be a conscious decision to make when weighing the trade offs.

Maybe in the future, the C++ standards will allow for static data structure types that you have to initialize with compile time constants (allowing constexpr), that are optimized for lookup times since they can’t have any inserts or deletes? I wonder how much demand there would be for that?

Here’s a good read on some other string friendly data structures:
Data Structures for Strings

Twitter user @ores brought up two interesting points:

  1. It would be interesting to see gperf performs in this situation. If makes a faster minimal perfect hash function, it’ll get us closer to the pre-hashed keys timings.
  2. It would be interesting to time scripting languages to see if for them code is faster than data or not. Another interesting aspect of this would be to look at a JIT compiled scripting language like lua-jit. The thing that makes JIT interesting is that it can compile for your specific CPU, instead of having to compile for a more generic set of CPU features. That gives it the opportunity to make code which will perform better on your specific machine.

Who Cares About Dynamic Array Growth Strategies?

Let’s say that you’ve allocated an array of 20 integers and have used them all. Now it’s time to allocate more, but you aren’t quite sure how many integers you will need in the end. What do you do?

Realloc is probably what you think of first for solving this problem, but let’s ignore that option for the moment. (If you haven’t used realloc before, give this a read! Alloca and Realloc – Useful Tools, Not Ancient Relics)

Without realloc you are left with allocating a new buffer of memory, copying the old buffer to the new buffer, and then freeing the old buffer.

The question remains though, how much memory should you allocate for this new, larger buffer?

You could double your current buffer size whenever you ran out of space. This would mean that as the buffer grew over time, you would do fewer allocations but would have more wasted (allocated but unused) memory.

You could also go the other way and just add 10 more int’s every time you ran out of space. This would mean that you would do a larger number of allocations (more CPU usage, possibly more fragmentation), but you’d end up with less wasted space.

Either way, it obviously depends entirely on usage patterns and it’s all subjective and situational.

…Or is it?

A Surprising Reason For Caring

Believe it or not, growth strategies can make a huge difference. Below we will explore the difference between the seemingly arbitrary rules of making a buffer twice as big, or 1.5 times as big.

Let’s say that we have a bunch of free memory starting at address 0. Let’s analyze what happens as we resize arrays in each scenario.

2x Buffer Size

First let’s see what happens when we double a buffer’s size when it gets full.

We start by allocating 16 bytes. The allocator gives us address 0 for our pointer.

When the buffer gets full, we allocate 32 bytes (at address 16), copy the 16 bytes into it and then free our first 16 byte buffer.

When that buffer gets full, we allocate 64 bytes (at address 48), copy the 32 bytes into it and then free our 32 byte buffer.

Lastly, that buffer gets full, so we allocate 128 bytes (at address 112), copy the 64 bytes into it and then free our 64 byte buffer.

As you can see, doubling the buffer size causes our pointer to keep moving further down in address space, and a free piece of memory before it will never be large enough to hold a future allocation.

1.5x Buffer Size

Let’s see what happens when we make a buffer 1.5x as large when it gets full.

We start by allocating 16 bytes. The allocator gives us address 0 for our pointer.

When the buffer gets full, we allocate 24 bytes (at address 16), copy the 16 bytes into it and then free our first 16 byte buffer.

When that buffer gets full, we allocate 36 bytes (at address 40), copy the 24 bytes into it and free the 24 byte buffer.

When that buffer gets full, we allocate 54 bytes (at address 76), copy the 36 bytes into it and free the 36 byte buffer.

When that buffer gets full, we allocate 81 bytes (at address 130), copy the 54 bytes into it and free the 54 byte buffer.

Lastly, when that buffer gets full, we need to allocate 122 bytes (we rounded it up). In this case, there is 130 bytes of unused memory starting at address 0, so we can just allocate 122 of those bytes, copy our 81 bytes into it and free the 81 byte buffer.

Our allocations have folded back into themselves. Our pattern of resizing hasn’t created an ever moving / ever growing memory fragmentation monster, unlike the buffer size doubling, which has!

Small Print

The above does decrease memory fragmentation, by encouraging an allocation to tend to stay in one spot in memory, but it comes at a cost. That cost is that since it’s allocating less extra memory when it runs out, that you will end up having to do more allocations to reach the same level of memory usage.

That might be a benefit though, depending on your specific needs. Another way of looking at that is that you will end up with fewer bytes of wasted memory. By wasted memory I mean allocated bytes which are not actually used to store anything.

Realloc Makes This Moot Right?

You may be thinking “well if I use realloc, I don’t need to care about this right?”

That isn’t exactly true. If realloc is unable to give you more memory at the current pointer location, it will allocate a new buffer, copy the old data to the new buffer, free the old buffer, and return you the pointer to the new buffer. This is exactly the case that happens when you don’t use realloc.

Using the above growth strategy with realloc makes realloc work even better. It’s a good thing!

Caveat: exotic allocator behavior may not actually benefit from using this strategy with realloc, so have a look for yourself if you are in doubt!

Links

Here’s a discussion on the topic:
What is the ideal growth rate for a dynamically allocated array?

From the link above, apparently the ideal factor to use when upsizing a buffer in general (when worrying about fragmentation like this), is the golden ratio 1.618. Weird, huh?

Have other information about why various growth factors matter? Leave a comment or drop me a line (:

Thanks to Tom for mentioning this concept. Pretty interesting and surprising IMO.

Shamir’s Quest: Collect Any 3 Keys To Unlock The Secret!

This post is on something called Shamir’s Secret Sharing. It’s a technique where you can break a secret number up into M different pieces, where if you have any N of those M pieces, you are able to figure out the secret.

Thinking of it in video game terms, imagine there are 10 keys hidden in a level, but you can escape the level whenever you find any 7 of them. This is what Shamir’s Secret Sharing enables you to set up cryptographically.

Interestingly in this case, the term sharing in “secret sharing” doesn’t mean sharing the secret with others. It means breaking the secret up into pieces, or SHARES. Secret sharing means that you make shares out of a secret, such that if you have enough of the shares, you can recover the secret.

How Do You Share (Split) The Secret?

The basic idea of how it works is actually really simple. This is good for us trying to learn the technique, but also good to show it’s security since there are so few moving parts.

It relies on something called the Unisolvence Theorem which is a fancy label meaning these things:

  • If you have a linear equation, it takes two (x,y) points to uniquely identify that line. No matter how you write a linear equation, if it passes through those same two points, it’s mathematically equivelant.
  • If you have a quadratic equation, it takes three (x,y) points to uniquely identify that quadratic curve. Again, no matter how you write a quadratic equation, if it passes through those same three points, it’s mathematically equivalent.
  • The pattern continues for equations of any degree. Cubic equations require four points to be uniquely identified, Quartic equations require five points, and so on.

At a high level, how this technique works is that the number of shares (keys) you want someone to collect (N ) defines the degree of an equation.

You use random numbers as the coefficients of the powers of x in that equation, but use your secret number as the constant term.

You then create M data points of the form (x,y) aka (x,f(x)) . Those are your shares. You then give individual shares to people, or go hide them in your dungeon or do whatever you are going to do with them.

As soon as any one person has N of those M shares (data points), they will be able to figure out the equation of the curve and thus get the secret.

The secret number is the constant term of the polynomial, which is also just f(0) .

This image below from wikipedia is great for seeing how you may have two points of a cubic curve, but without a third point you can’t be sure what the quadratic equation is. In fact, there are an infinite number of quadratic curves that pass through any two points! Because of that, it takes the full number of required shares for you to be able to unlock the secret.

Example: Sharing (Splitting) The Secret

First you decide how many shares you want it to take to unlock the secret. This determines the degree of your equation.

Let’s say you wanted a person to have to have four shares to unlock the secret. This means our equation will be a cubic equation, since it takes four points to uniquely define a cubic equation.

Our equation is:

f(x) = R_1x^3 + R_2x^2 + R_3x + S

Where the R_i values are random numbers, and S is the secret value.

Let’s say that our secret value is 435, and that we picked some random numbers for the equation, making the below:

f(x) = 28x^3 + 64x^2 + 9x + 435

We now have a function that is uniquely identifiable by any 4 points of data on it’s curve.

Next we decide how many pieces we are going to create total. We need at least 4 so that it is in fact solvable. Let’s make 6 shares.

To do this, you just plug in 6 different values of x and pair each x value with it’s y value. Let’s do that:

\begin{array}{c|c} x & f(x) \\ \hline 1 & 536 \\ 2 & 933 \\ 3 & 1794 \\ 4 & 3287 \\ 5 & 5580 \\ 6 & 8841 \\ \end{array}

When doing this part, remember that the secret number is f(0) , so make sure and not share what the value of the function is when x is 0!

You could then distribute the shares (data pairs) as you saw fit. Maybe some people are more important, so you give them more than one share, requiring a smaller amount of cooperation with them to unlock the secret.

Share distribution details are totally up to you, but we now have our shares, whereby if you have any of the 4 of the 6 total shares, you can unlock the secret.

How Do You Join The Secret?

Once you have the right number of shares and you know the degree of the polynomial (pre-shared “public” information), unlocking the secret is a pretty straightforward process too. To unlock the secret, you just need to use ANY method available for creating an equation of the correct degree from a set of data points.

This can be one of several different interpolation techniques, but the most common one to use seems to be Lagrange interpolation, which is something I previously wrote up that you can read about here: Lagrange Interpolation.

Once you have the equation, you can either evaluate f(0) , or you can write the equation in polynomial form and the constant term will be the secret value.

Example: Joining the Secret

Let’s say that we have these four shares and are ready to get the cubic function and then unlock the secret number:

\begin{array}{c|c} x & y \\ \hline 1 & 536 \\ 2 & 933 \\ 4 & 3287 \\ 6 & 8841 \\ \end{array}

We could bust out some Lagrange interpolation and figure this out, but let’s be lazy… err efficient I mean. Wolfram alpha can do this for us!

Wolfram Alpha: cubic fit (1, 536), (2, 933), (4, 3287), (6, 8841)

That gives us this equation, saying that it is a perfect fit (which it is!)
28x^3 + 64x^2 + 9x + 435

You can see that our constant term (and f(0) ) is the correct secret value of 435.

Daaaayummm Bru… that is lit AF! We just got hacked by wolfram alpha 😛

A Small Complication

Unfortunately, the above has a weakness. The weakness is that each share you get gives you a little bit more information about the secret value. You can read more about this in the links section at the end if you want to know more details.

Ideally, you wouldn’t have any information about the secret value until you had the full number of shares required to unlock the secret.

To address this problem, we are going to choose some prime number k and instead of shares being (x,y) data points on the curve, they are going to be (x,y \bmod k) . In technical terms we are going to be using points on a finite field, or a Galois field.

The value we choose for k needs to be larger than any of the coefficients of our terms (the random numbers) as well as larger than our secret value and larger than the number of shares we want to create. The larger the better besides that, because a larger k value means a larger “brute force” space to search.

If you want to use this technique in a situation which has real needs for security, please make sure and read more on this technique from more authoritative sources. I’m glossing over the details of security quite a bit, and just trying to give an intuitive understanding of this technique (:

Source Code

Below is some sample source code that implements Shamir’s Secret Sharing in C++.

I use 64 bit integers, but if you were going to be using this in a realistic situation you could very well overflow 64 bit ints and get the wrong answers. I hit this problem for instance when trying to require more than about 10 shares, using a prime of 257, and generating 50 shares. If you hit the limit of 64 bit ints you can use a multi precision math library instead to have virtually unlimited sized ints. The boost multiprecision header library is a decent choice for multi precision integers, specifically cpp_int.

#include <stdio.h>
#include <array>
#include <vector>
#include <math.h>
#include <random>
#include <assert.h>
#include <stdint.h>
#include <inttypes.h>

typedef int64_t TINT;
typedef std::array<TINT, 2> TShare;
typedef std::vector<TShare> TShares;

class CShamirSecretSharing
{
public:
    CShamirSecretSharing (size_t sharesNeeded, TINT prime)
        : c_sharesNeeded(sharesNeeded), c_prime(prime)
    {
        // There needs to be at least 1 share needed
        assert(sharesNeeded > 0);
    }

    // Generate N shares for a secretNumber
    TShares GenerateShares (TINT secretNumber, TINT numShares) const
    {
        // calculate our curve coefficients
        std::vector<TINT> coefficients;
        {
            // store the secret number as the first coefficient;
            coefficients.resize((size_t)c_sharesNeeded);
            coefficients[0] = secretNumber;

            // randomize the rest of the coefficients
            std::array<int, std::mt19937::state_size> seed_data;
            std::random_device r;
            std::generate_n(seed_data.data(), seed_data.size(), std::ref(r));
            std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
            std::mt19937 gen(seq);
            std::uniform_int_distribution<TINT> dis(1, c_prime - 1);
            for (TINT i = 1; i < c_sharesNeeded; ++i)
                coefficients[(size_t)i] = dis(gen);
        }

        // generate the shares
        TShares shares;
        shares.resize((size_t)numShares);
        for (size_t i = 0; i < numShares; ++i)
            shares[i] = GenerateShare(i + 1, coefficients);
        return shares;
    }

    // use lagrange polynomials to find f(0) of the curve, which is the secret number
    TINT JoinShares (const TShares& shares) const
    {
        // make sure there is at elast the minimum number of shares
        assert(shares.size() >= size_t(c_sharesNeeded));

        // Sigma summation loop
        TINT sum = 0;
        for (TINT j = 0; j < c_sharesNeeded; ++j)
        {
            TINT y_j = shares[(size_t)j][1];

            TINT numerator = 1;
            TINT denominator = 1;

            // Pi product loop
            for (TINT m = 0; m < c_sharesNeeded; ++m)
            {
                if (m == j)
                    continue;

                numerator = (numerator * shares[(size_t)m][0]) % c_prime;
                denominator = (denominator * (shares[(size_t)m][0] - shares[(size_t)j][0])) % c_prime;
            }

            sum = (c_prime + sum + y_j * numerator * modInverse(denominator, c_prime)) % c_prime;
        }
        return sum;
    }

    const TINT GetPrime () const { return c_prime; }
    const TINT GetSharesNeeded () const { return c_sharesNeeded; }

private:

    // Generate a single share in the form of (x, f(x))
    TShare GenerateShare (TINT x, const std::vector<TINT>& coefficients) const
    {
        TINT xpow = x;
        TINT y = coefficients[0];
        for (TINT i = 1; i < c_sharesNeeded; ++i) {
            y += coefficients[(size_t)i] * xpow;
            xpow *= x;
        }
        return{ x, y % c_prime };
    }

    // Gives the decomposition of the gcd of a and b.  Returns [x,y,z] such that x = gcd(a,b) and y*a + z*b = x
    static const std::array<TINT, 3> gcdD (TINT a, TINT b) {
        if (b == 0)
            return{ a, 1, 0 };

        const TINT n = a / b;
        const TINT c = a % b;
        const std::array<TINT, 3> r = gcdD(b, c);

        return{ r[0], r[2], r[1] - r[2] * n };
    }

    // Gives the multiplicative inverse of k mod prime.  In other words (k * modInverse(k)) % prime = 1 for all prime > k >= 1 
    static TINT modInverse (TINT k, TINT prime) {
        k = k % prime;
        TINT r = (k < 0) ? -gcdD(prime, -k)[2] : gcdD(prime, k)[2];
        return (prime + r) % prime;
    }

private:
    
    // Publically known information
    const TINT          c_prime;
    const TINT          c_sharesNeeded;
};

void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}

int main (int argc, char **argv)
{
    // Parameters
    const TINT c_secretNumber = 435;
    const TINT c_sharesNeeded = 7;
    const TINT c_sharesMade = 50;
    const TINT c_prime = 439;   // must be a prime number larger than the other three numbers above

    // set up a secret sharing object with the public information
    CShamirSecretSharing secretSharer(c_sharesNeeded, c_prime);

    // split a secret value into multiple shares
    TShares shares = secretSharer.GenerateShares(c_secretNumber, c_sharesMade);

    // shuffle the shares, so it's random which ones are used to join
    std::array<int, std::mt19937::state_size> seed_data;
    std::random_device r;
    std::generate_n(seed_data.data(), seed_data.size(), std::ref(r));
    std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
    std::mt19937 gen(seq);
    std::shuffle(shares.begin(), shares.end(), gen);

    // join the shares
    TINT joinedSecret = secretSharer.JoinShares(shares);

    // show the public information and the secrets being joined
    printf("%" PRId64 " shares needed, %i shares maden", secretSharer.GetSharesNeeded(), shares.size());
    printf("Prime = %" PRId64 "nn", secretSharer.GetPrime());
    for (TINT i = 0, c = secretSharer.GetSharesNeeded(); i < c; ++i)
        printf("Share %" PRId64 " = (%" PRId64 ", %" PRId64 ")n", i+1, shares[i][0], shares[i][1]);

    // show the result
    printf("nJoined Secret = %" PRId64 "nActual Secret = %" PRId64 "nn", joinedSecret, c_secretNumber);
    assert(joinedSecret == c_secretNumber);
    WaitForEnter();
    return 0;
}

Example Output

Here is some example output of the program:

Links

Wikipedia: Shamir’s Secret Sharing (Note: for some reason the example javascript implementation here only worked for odd numbered keys required)
Wikipedia: Finite Field
Cryptography.wikia.com: Shamir’s Secret Sharing
Java Implementation of Shamir’s Secret Sharing (Note: I don’t think this implementation is correct, and neither is the one that someone posted to correct them!)

When writing this post I wondered if maybe you could use the coefficients of the other terms as secrets as well. These two links talk about the details of that:
Cryptography Stack Exchange: Why only one secret value with Shamir’s secret sharing?
Cryptography Stack Exchange: Coefficients in Shamir’s Secret Sharing Scheme

Now that you understand this, you are probably ready to start reading up on elliptic curve cryptography. Give this link below a read if you are interested in a gentle introduction on that!
A (Relatively Easy To Understand) Primer on Elliptic Curve Cryptography

Turning a Truth Table Into A digital Circuit (ANF)

In this post I’m going to show how you turn a truth table into a digital logic circuit that uses XOR and AND gates.

My Usage Case

My specific usage case for this is in my investigations into homomorphic encryption, which as you may recall is able to perform computation on encrypted data. This lets encrypted data be operated on by an untrusted source, given back to you, and then you can decrypt your data to get a result.

Lots of use cases if this can ever get fast enough to become practical, such as doing cloud computing with private data. However, when doing homomorphic encryption (at least currently, for the techniques I’m using), you only have XOR and AND logic operations.

So, I’m using the information in this post to be able to turn a lookup table, or a specific boolean function, into a logic circuit that I can feed into a homomorphic encryption based digital circuit.

Essentially I want to figure out how to do a homomorphic table lookup to try and make some simple as possible circuits, that will in turn be as fast and lean as possible.

If you want to know more about homomorphic encryption, here’s a post I wrote which explains a very simple algorithm: Super Simple Symmetric Leveled Homomorphic Encryption Implementation

Algebraic Normal Form

Algebraic normal form (ANF) is a way of writing a boolean function using only XOR and AND.

Since it’s a normal form, two functions that do the same thing will be the same thing in ANF.

There are other forms for writing boolean logic, but ANF suits me best for my homomorphic encryption circuit needs!

An example of boolean logic in ANF is the below:

f(x_1, x_2, x_3, x_4) = x_1 x_2 \oplus x_1 x_3 \oplus x_1 x_4

It is essentially a boolean polynomial, where AND is like multiplication, and XOR is like addition. It even factors the same way. In fact, ANF is not always the smallest circuit possible, you’d have to factor common ANDs to find the smallest way you could represent the circuit, like the below:

f(x_1, x_2, x_3, x_4) = x_1 (x_2 \oplus x_3 \oplus x_4)

That smaller form does 1 AND and 2 XORs, versus the ANF which does 3 ANDs and 2 XORs. In homomorphic encryption, since AND is so much more costly than XOR, minimizing the ANDs is a very nice win, and worth the effort.

Wikipedia has some more info about ANF here: Wikipedia: Algebraic normal form

Truth Tables and Lookup Tables

A truth table is just where you specify the inputs into a boolean function and the output of that boolean function for the given input:

\begin{array}{c|c|c|c} x_1 & x_2 & x_3 & f(x_1, x_2, x_3) \\ \hline 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ \end{array}

A lookup table is similar in functionality, except that it has multi bit output. When dealing with digital circuits, you can make a lookup table by making a truth table per output bit. For instance, the above truth table might just be the low bit of the lookup table below, which is just a truth table for addition of the input bits.

\begin{array}{c|c|c|c} x_1 & x_2 & x_3 & f(x_1, x_2, x_3) \\ \hline 0 & 0 & 0 & 00 \\ 0 & 0 & 1 & 01 \\ 0 & 1 & 0 & 01 \\ 0 & 1 & 1 & 10 \\ 1 & 0 & 0 & 01 \\ 1 & 0 & 1 & 10 \\ 1 & 1 & 0 & 10 \\ 1 & 1 & 1 & 11 \\ \end{array}

Converting Truth Table to ANF

When I first saw the explanation for converting a truth table to ANF, it looked pretty complicated, but luckily it turns out to be pretty easy.

The basic idea is that you make a term for each possible combination of x inputs, ANDing a term by each constant, and then solving for those constants.

Let’s use the truth table from the last section:

\begin{array}{c|c|c|c} x_1 & x_2 & x_3 & f(x_1, x_2, x_3) \\ \hline 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ \end{array}

For three inputs, the starting equation looks like this:

f(x_1, x_2, x_3) = \\ a_0 \\ \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3 \\ \oplus a_{12} x_1 x_2 \oplus a_{13} x_1 x_3 \oplus a_{23} x_2 x_3 \\ \oplus a_{123} x_1 x_2 x_3

Now we have to solve for the a values.

To solve for a_{123} , we just look in the truth table for function f(x_1, x_2, x_3) to see if we have an odd or even number of ones in the output of the function. If there is an even number, it is 0, else it is a 1.

Since we have an even number of ones, the value is 0, so our equation becomes this:

f(x_1, x_2, x_3) = \\ a_0 \\ \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3 \\ \oplus a_{12} x_1 x_2 \oplus a_{13} x_1 x_3 \oplus a_{23} x_2 x_3 \\ \oplus 0 \land x_1 x_2 x_3

Note that \land is the symbol for AND. I’m showing it explicitly because otherwise the equation looks weird, and a multiplication symbol isn’t correct.

Since 0 ANDed with anything else is 0, and also since n XOR 0 = n, that whole last term disappears, leaving us with this equation:

f(x_1, x_2, x_3) = \\ a_0 \\ \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3 \\ \oplus a_{12} x_1 x_2 \oplus a_{13} x_1 x_3 \oplus a_{23} x_2 x_3

Next up, to solve for a_{12} , we need to limit our truth table to f(x_1, x_2, 0) . That truth table is below, made from the original truth table, but throwing out any row where x_{3} is 1.

\begin{array}{c|c|c|c} x_1 & x_2 & x_3 & f(x_1, x_2, 0) \\ \hline 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 \\ \end{array}

We again just look at whether there are an odd or even number of ones in the function output, and use that to set a_{12} appropriately. In this case, there are an even number, so we set it to 0, which makes that term disappear again. Our function is now down to this:

f(x_1, x_2, x_3) = \\ a_0 \\ \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3 \\ \oplus a_{13} x_1 x_3 \oplus a_{23} x_2 x_3

If we look at f(x_1,0,x_3) , we find that it also has an even number of ones, making a_{13} become 0 and making that term disappear.

Looking at f(0,x_2,x_3) , it also has an even number of ones, making a_{23} become 0 and making that term disappear as well.

That leaves us with this equation:

f(x_1, x_2, x_3) = \\ a_0 \\ \oplus a_1 x_1 \oplus a_2 x_2 \oplus a_3 x_3

To solve for a_1 , we look at the truth table for f(x_1,0,0) , which is below:

\begin{array}{c|c|c|c} x_1 & x_2 & x_3 & f(x_1, 0, 0) \\ \hline 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 \\ \end{array}

There are an odd number of ones in the output, so a_1 becomes 1. Finally, we get to keep a term! The equation is below:

f(x_1, x_2, x_3) = \\ a_0 \\ \oplus 1 \land x_1 \oplus a_2 x_2 \oplus a_3 x_3

Since 1 AND n = n, we can drop the explicit 1 to become this:

f(x_1, x_2, x_3) = \\ a_0 \\ \oplus x_1 \oplus a_2 x_2 \oplus a_3 x_3

If you do the same process for a_2 and a_3 , you’ll find that they also have odd numbers of ones in the output so also become ones. That puts our equation at:

f(x_1, x_2, x_3) = \\ a_0 \\ \oplus x_1 \oplus x_2 \oplus x_3

Solving for a_0 , is just looking at whether there are an odd or even number of ones in the function f(0,0,0) which you can look up directly in the lookup table. It’s even, so a_0 becomes 0, which makes our full final equation into this:

f(x_1, x_2, x_3) = x_1 \oplus x_2 \oplus x_3

We are done! This truth table can be implemented with 3 XORs and 0 ANDs. A pretty efficient operation!

You can see this is true if you work it out with the truth table. Try it out and see!

\begin{array}{c|c|c|c} x_1 & x_2 & x_3 & f(x_1, x_2, x_3) \\ \hline 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ \end{array}

Sample Code

Here is some sample code that lets you define a lookup table by implementing an integer function, and it generates the ANF for each output bit for the truth table. It also verifies that the ANF gives the correct answer. It shows you how to use this to make various circuits: bit count, addition, multiplication, division and modulus.

#include <stdio.h>
#include <array>
#include <vector>

#define PRINT_TRUTHTABLES() 0
#define PRINT_NUMOPS() 1
#define PRINT_ANF() 1

void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}

template <size_t NUM_INPUT_BITS>
bool LookupEntryPassesMask (size_t entry, size_t mask)
{
    for (size_t i = 0; i < NUM_INPUT_BITS; ++i)
    {
        const size_t bitMask = 1 << i;
        const bool allowOnes = (mask & bitMask) != 0;
        const bool bitPassesMask = allowOnes || (entry & bitMask) == 0;
        if (!bitPassesMask)
            return false;
    }
    return true;
}

template <size_t NUM_INPUT_BITS>
bool ANFHasTerm (const std::array<size_t, 1 << NUM_INPUT_BITS> &lookupTable, size_t outputBitIndex, size_t termMask)
{
    const size_t c_inputValueCount = 1 << NUM_INPUT_BITS;

    int onesCount = 0;
    for (size_t i = 0; i < c_inputValueCount; ++i)
    {
        if (LookupEntryPassesMask<NUM_INPUT_BITS>(i, termMask) && ((lookupTable[i] >> outputBitIndex) & 1) != 0)
            onesCount++;
    }

    return (onesCount & 1) != 0;
}

template <size_t NUM_INPUT_BITS>
void MakeANFTruthTable (const std::array<size_t, 1 << NUM_INPUT_BITS> &lookupTable, std::array<size_t, 1 << NUM_INPUT_BITS> &reconstructedLookupTable, size_t outputBitIndex)
{
    const size_t c_inputValueCount = 1 << NUM_INPUT_BITS;
    printf("-----Output Bit %u-----rn", outputBitIndex);

    // print truth table if we should
    #if PRINT_TRUTHTABLES()
        for (size_t inputValue = 0; inputValue < c_inputValueCount; ++inputValue)
            printf("  [%u] = %urn", inputValue, ((lookupTable[inputValue] >> outputBitIndex) & 1) ? 1 : 0);
        printf("rn");
    #endif

    // find each ANF term
    std::vector<size_t> terms;
    for (size_t termMask = 0; termMask < c_inputValueCount; ++termMask)
    {
        if (ANFHasTerm<NUM_INPUT_BITS>(lookupTable, outputBitIndex, termMask))
            terms.push_back(termMask);
    }

    // print function params
    #if PRINT_ANF()
        printf("f(");
        for (size_t i = 0; i < NUM_INPUT_BITS; ++i)
        {
            if (i > 0)
                printf(",");
            printf("x%i",i+1);
        }
        printf(") = rn");
    #endif

    // print ANF and count XORs and ANDs
    size_t numXor = 0;
    size_t numAnd = 0;
    if (terms.size() == 0)
    {
        #if PRINT_ANF()
        printf("0rn");
        #endif
    }
    else
    {
        for (size_t termIndex = 0, termCount = terms.size(); termIndex < termCount; ++termIndex)
        {
            if (termIndex > 0) {
                #if PRINT_ANF()
                printf("XOR ");
                #endif
                ++numXor;
            }

            size_t term = terms[termIndex];
            if (term == 0)
            {
                #if PRINT_ANF()
                printf("1");
                #endif
            }
            else
            {
                bool firstProduct = true;
                for (size_t bitIndex = 0; bitIndex < NUM_INPUT_BITS; ++bitIndex)
                {
                    const size_t bitMask = 1 << bitIndex;
                    if ((term & bitMask) != 0)
                    {
                        #if PRINT_ANF()
                        printf("x%i ", bitIndex + 1);
                        #endif
                        if (firstProduct)
                            firstProduct = false;
                        else
                            ++numAnd;
                    }
                }
            }
            #if PRINT_ANF()
            printf("rn");
            #endif
        }
    }
    #if PRINT_ANF()
    printf("rn");
    #endif

    #if PRINT_NUMOPS()
    printf("%u XORs, %u ANDsrnrn", numXor, numAnd);
    #endif

    // reconstruct a bit of the reconstructedLookupTable for each entry to be able to verify correctness
    const size_t c_outputBitMask = 1 << outputBitIndex;
    for (size_t valueIndex = 0; valueIndex < c_inputValueCount; ++valueIndex)
    {
        bool xorSum = false;
        for (size_t termIndex = 0, termCount = terms.size(); termIndex < termCount; ++termIndex)
        {
            size_t term = terms[termIndex];
            if (term == 0)
            {
                xorSum = 1 ^ xorSum;
            }
            else
            {
                bool andProduct = true;
                for (size_t bitIndex = 0; bitIndex < NUM_INPUT_BITS; ++bitIndex)
                {
                    const size_t bitMask = 1 << bitIndex;
                    if ((term & bitMask) != 0)
                    {
                        if ((valueIndex & bitMask) == 0)
                            andProduct = false;
                    }
                }
                xorSum = andProduct ^ xorSum;
            }
        }
        if (xorSum)
            reconstructedLookupTable[valueIndex] |= c_outputBitMask;
    }
}

template <size_t NUM_INPUT_BITS, size_t NUM_OUTPUT_BITS, typename LAMBDA>
void MakeANFLookupTable (const LAMBDA& lambda)
{
    // make lookup table
    const size_t c_outputValueMask = (1 << NUM_OUTPUT_BITS) - 1;
    const size_t c_inputValueCount = 1 << NUM_INPUT_BITS;
    std::array<size_t, c_inputValueCount> lookupTable;
    for (size_t inputValue = 0; inputValue < c_inputValueCount; ++inputValue)
        lookupTable[inputValue] = lambda(inputValue, NUM_INPUT_BITS, NUM_OUTPUT_BITS) & c_outputValueMask;

    // make the anf for each truth table (each output bit of the lookup table)
    std::array<size_t, c_inputValueCount> reconstructedLookupTable;
    std::fill(reconstructedLookupTable.begin(), reconstructedLookupTable.end(), 0);
    for (size_t outputBitIndex = 0; outputBitIndex < NUM_OUTPUT_BITS; ++outputBitIndex)
        MakeANFTruthTable<NUM_INPUT_BITS>(lookupTable, reconstructedLookupTable, outputBitIndex);

    // verify that our anf expressions perfectly re-create the lookup table
    for (size_t inputValue = 0; inputValue < c_inputValueCount; ++inputValue)
    {
        if (lookupTable[inputValue] != reconstructedLookupTable[inputValue])
            printf("ERROR: expression / lookup mismatch for index %urn", inputValue);
    }
    printf("expression / lookup verification complete.rnrn");
}

size_t CountBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
{
    // Count how many bits there are
    int result = 0;
    while (inputValue)
    {
        if (inputValue & 1)
            result++;
        inputValue = inputValue >> 1;
    }
    return result;
}

size_t AddBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
{
    // break the input bits in half and add them
    const size_t bitsA = numInputBits / 2;
    const size_t mask = (1 << bitsA) - 1;

    size_t a = inputValue & mask;
    size_t b = inputValue >> bitsA;
    
    return a+b;
}

size_t MultiplyBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
{
    // break the input bits in half and add them
    const size_t bitsA = numInputBits / 2;
    const size_t mask = (1 << bitsA) - 1;

    size_t a = inputValue & mask;
    size_t b = inputValue >> bitsA;

    return a * b;
}

size_t DivideBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
{
    // break the input bits in half and add them
    const size_t bitsA = numInputBits / 2;
    const size_t mask = (1 << bitsA) - 1;

    size_t a = inputValue & mask;
    size_t b = inputValue >> bitsA;

    // workaround for divide by zero
    if (b == 0)
        return 0;

    return a / b;
}

size_t ModulusBits (size_t inputValue, size_t numInputBits, size_t numOutputBits)
{
    // break the input bits in half and add them
    const size_t bitsA = numInputBits / 2;
    const size_t mask = (1 << bitsA) - 1;

    size_t a = inputValue & mask;
    size_t b = inputValue >> bitsA;

    // workaround for divide by zero
    if (b == 0)
        return 0;

    return a % b;
}

int main (int argc, char **argv)
{
    //MakeANFLookupTable<3, 2>(CountBits);    // Output bits needs to be enough to store the number "input bits"
    //MakeANFLookupTable<4, 3>(AddBits);      // Output bits needs to be (InputBits / 2)+1
    //MakeANFLookupTable<4, 4>(MultiplyBits); // Output bits needs to be same as input bits
    //MakeANFLookupTable<4, 2>(DivideBits);   // Output bits needs to be half of input bits (rounded down)
    //MakeANFLookupTable<4, 2>(ModulusBits);  // Output bits needs to be half of input bits (rounded down)
    //MakeANFLookupTable<10, 5>(DivideBits);  // 5 bit vs 5 bit division is amazingly complex!
    MakeANFLookupTable<4, 2>(ModulusBits);  // Output bits needs to be half of input bits (rounded down)
    WaitForEnter();
    return 0;
}

Sample Code Runs

Here is the program output for a “bit count” circuit. It counts the number of bits that are 1, in the 3 bit input, and outputs the answer as 2 bit output. Note that the bit 0 output is the same functionality as the example we worked through by hand, and you can see that it comes up with the same answer.

Here is the program output for an adder circuit. It adds two 2 bit numbers, and outputs a 3 bit output.

Here is the program output for a multiplication circuit. It multiplies two 2 bit numbers, and outputs a 4 bit number.

Here is the program output for a division circuit. It divides a 2 bit number by another 2 bit number and outputs a 2 bit number. When higher bit counts are involved, the division circuit gets super complicated, it’s really crazy! 5 bit divided by 5 bit is several pages of output for instance. Note that it returns 0 whenever it would divide by 0.

Lastly, here is the program output for a modulus circuit. It divides a 2 bit number by another 2 bit number and outputs the remainder as a 2 bit number.

Closing and Links

While the above shows you how to turn a single bit truth table into ANF, extending this to a multi bit lookup table is super simple; you just do the same process for each output bit in the lookup table.

Here are a few links in case anything above is unclear, or you want more information.

Finding Boolean/Logical Expressions for truth tables in algebraic normal form(ANF)

Finding Boolean/Logical Expressions for truth tables

Using Wang Tiles to Simulate Turing Machines

Wang tiles were invented by Hao Wang in 1961 for mathematical reasons, but they find great use in games for making tile based art which gives results that don’t look tiled – both with 2d tiled textures, as well as 3d tiled models.

Apparently Wang tiles are also able to execute Turing machines, and so are thus Turing complete – meaning they can execute any program.

That is a pretty amazing and perplexing statement, so this post explores that a bit.

Wang Tiles Overview

Wang tiles are rectangular tiles where each edge will only fit with other specific edges, but that for any specific edge, there is more than one possible tile that can fit with that edge. By fit with that edge, I mean they are seamless when put together, without any visual artifacts to hint at there actually being a seam between the tiles.

This is useful for graphics because this lets you have seamless tiled graphics, but the specific configuration of how the tiles are placed can be completely randomized, so long as their edges are all compatible. The result is tiled graphics that doesn’t look at all tiled, due to visible patterns being much less noticeable than with traditional tiled graphics.

For graphical examples, a little more info and some links to some shadertoys check this out: Wang Tiling

Here is an example I made. The graphics are programmer art but hopefully you get the idea. This was made with 16 tiles, where there were two different edge types per edge.

Turing Machine Overview

Turing machines were invented in 1936 by Alan Turing as a generic computing machine that was proven to be able to execute any algorithm.

The turing machine is made up of a few main components: the memory tape, the read/write head, and the state machine.

The memory tape is infinitely long, so has infinite storage, and is initialized to all zeroes to start out.

The read/write head starts at a position on the tape, and can read or write values, and also move left or right on the tape.

The state machine controls the read/write head.

The state machine knows what state it is in and has rules about what to do in each state when it reads a value from the tape.

For instance, in state A, if a 0 is read from the tape, the rule may be to write a 1 to the current position on the tape, move the read/write head to the right, and go to state B. State B may have completely different logic, and could either transition back to state A, state in state B, or move to another state entirely.

Using simple state transition logic like that, any computer algorithm can be performed.

In a Turing machine there can also be a “Halt State” which means the program is finished executing and the answer it was trying to calculate has been calculated.

Looking at some programs, you can easily see that they will halt eventually, or that they will be an infinite loop and never halt. Some programs in-between are complex and can’t very easily be determined if they will ever halt or not. Turing proved that there is no general solution to whether a Turing machine (aka any computer program) will halt or not, and this is called the Halting Problem. In general, the only way to know if a program will halt or not is to wait and see. So, effectively the answers to whether a program in general will halt or not are “yes” and “not yet” – although for many specific programs you can in fact see that they will halt eventually if you were to run them.

Wang Tile Computation

It turns out that Wang tiles can simulate a Turing machine, and so are “Turing complete” meaning that they too can perform any computer algorithm.

To make this happen, we’ll make a column of tiles that represent the state of the Turing machine at a specific point in time, starting with time 0 at the left most column. We’ll place tiles in the column to the right making sure all edge rules are respected, and then do the column to the right of that one etc until the program halts (or forever if it doesn’t halt). If we set up our set of tiles correctly, the act of satisfying the edge rules as we place our tiles is enough to execute the Turing machine.

Let’s walk through a simple example where we have the following state machine logic rules:

  1. When in state A, if a 0 is read, we will write a 1, move the read/write head down and move to state B.
  2. When in state A, if a 1 is read, we will halt (enter the halt state).
  3. When in state B, if a 0 is read, we will write a 1, move the read/write head up and move to state A.
  4. When in state B, if a 1 is read, we will halt (enter the halt state).

Tape Memory Storage

The first thing we need is persistent storage of memory for the tape. For this, we’ll need the following two tiles:

To see this working, we can set up a section of tape with some values (make a column of wang tiles), and we can see that the only valid wang tiles to place next to the starting column are tiles which propogate the 0 and the 1 values forward in time without modifying them.

In the diagram below, we initialize the tape to 0101 in the left most column (time 0). By only placing down tiles with compatible edges you can see that our memory values persist forever. Our memory storage is implemented, huzzah!

We’ll start our example with all memory initialized to 0, but the above shows that we can have persistent memory.

Read/Write Head State Machine

The read/write head of the Turing machine is represented as part of the edge information. In this way, besides an edge storing the 0 or 1, if that is where the read/write head is, it also stores the state of the state machine.

Our example uses two states (besides the halt state): A and B. If a 1 is read in while being in either state A or B, the program halts.

To handle that, we need the tiles below:

Now that we have the rules for entering the halt state done (rule #2 and rule #4), we have to figure out how to implement the rules that control switching from one state to another (rule #1 and rule #3).

Moving the Read/Write Head

Rule #1 says that if we are in state A and read a 0, we should write a 1, move the read/write head down and move to state B.

We’ll need this tile to cause reading a 0 in state A to write a 1 as output, and to tell the tile below to move to state B.

The tile below that one could either be a 0 or a 1, and without knowing which, we want it to keep it’s value but accept the read/write head and be in state B. To do that we need two tiles, one for if there is a 0 on the tape at that position, and another for if there is a 1 on the tape.

Rule #3 says that if we are in state B and read a 0, we should write a 1, move the read/write head up and move to state A.

To do that, we’ll need a similar construction as for rule #1 but we are moving up instead of down. These 3 tiles will give us what we need:

Starting Column Tiles

We are going to treat the boundaries of our simulation area as if they have edges of “x”.

This means that to make our starting column (the Turing machine at time 0), we are going to need 2 special tiles. One tile will be for storing a 0 on the tape, which is what the tape is initialized to, and the other tile will be for storing the position of the read/write head in state A, which is our starting state.

Here are those two tiles:

Final Tileset

Here’s the full set of 12 tiles that we are going to use:

Full Simulation

Here is the initial setup at time 0 for our Turing machine. Note that this is one possible starting state, but this is the starting state we are choosing. We are not leaving it up to chance where the read/write head starts, or if it is even present at all. If we only followed edge rules we may get 4 read/write heads or 0, or anything in between.

From here, to build the second column, we start from the top and work towards the bottom, choosing the tile that fits the constraints of the edge it touches. In this first step, the head reads a 0, writes a 1, moves down, and moves to state B.

Heres is the second step, where the read reads a 0, writes a 1, moves up, and moves to state A.

Here is the final step, where the head reads a 1 and enters the halt state, signifying that the program has terminated.

The program halted, and gave an output value of “0110” or 6. This output isn’t really meaningful but other programs can give output that is meaningful. For instance you could have a Turing machine add two numbers, and the output would be the sum of those two numbers.

An Important Detail

There is an important detail that the above doesn’t address, and that many explanations of Wang tile Turing machines don’t seem to talk about.

When placing the second tile for time 2, the only constraint from the edges is that the tile must have an x on top and a 1 on the left. This actually makes it ambiguous which tile should be chosen between the two tiles below.

How do we choose the right one then?

The answer is that you make a guess and just choose one. If the wrong one was chosen in this case, when we moved to the next tile, we’d be looking for a tile which had an x on top and a B0 on the left. There is no such tile so you’d be unable to place a tile. When this happened, you’d take a step back to the last tile, and try one of the other possibilities.

So, unfortunately there is some literal trial and error involved when simulating Turing machines with Wang tiles, but it is fairly manageable at least. It definitely makes it a bit more complex to calculate in a pixel shader if you were so inclined (or other massively parallel processing units), but it shouldn’t be that much more costly.

Closing & Links

Some of the links below talk about Wang tiles and Turing machines, but don’t seem to strictly be Turing machines anymore. For instance, you might notice that some examples allow data to travel “back in time” where after the program halts, the answer is in the tape at time 0 of the Turing machine, even though that data wasn’t actually there at time 0. This shows that Wang tiles can do computation in their own right, beyond simulating Turing machines, but I’m not really sure what that technique itself would be called.

Also if you are wondering if this is useful to do computation with Wang tiles, I’m not really sure of any practical usage cases myself. However, apparently scientists have found that DNA can act much like Wang tiles act, where they will fit together only if edges are compatible. Because of this, there is ongoing research into DNA based computation that is based on the work of Wang tile computation. pretty interesting stuff!

Here is a shadertoy implementation of wang tiles computing prime numbers in a webgl pixel shader:
Shadertoy: WangTiles : PrimeGenerator

Here are some great videos on Turing machines and the halting problem:
Turing Machines Explained – Computerphile
Turing & The Halting Problem – Computerphile

Here are some other links:
Computing With Tiles
Wikipedia: Wang Tile
Wang Tiles and Turing Machines
Wang Tiles – 1

Here are some academic papers:
Computing With Tiles
Computability of Tilings

Matrix Form of Bezier Curves

Bezier curves are most often talked about either in terms of the De Casteljau algorithm, or in terms of a mathematical function (Bernstein Polynomials).

Every now and then though, you see people talking about Bezier curves being calculated via matrices. If you ever wondered what that was all about, this post should hopefully explain and demystify that a bit.

If you don’t know how to come up with the equation of a Bezier curve for any number of control points, you should give this a read first:
Easy Binomial Expansion & Bezier Curve Formulas

And if you are curious about the De Casteljau algorithm, you can learn about that here:
The De Casteljau Algorithm for Evaluating Bezier Curves

Ok, all read up on that stuff? Let’s get talking about Bezier curves in matrix form! There are shadertoy links at the end with working wegl glsl demos that include source code.

Making the Matrix Form of Bezier Curves

Coming up with the matrix for a Bezier curve is surprisingly easy. Keep in mind the matrix we are making is for glsl which is a column major matrix order, so you might have to adjust things if you are using a row major matrix order setup (mostly, just transpose the matrix).

The first step is to get the formula for a Bezier curve. We’ll work through the example using a quadratic Bezier curve with 3 control points A,B,C, so we start with the formula below:

f(t) = A*(1-t)^2 + B*2t(1-t) + C*t^2

The next step is to break the equation into one equation per term. Each term has a control point, so we are basically splitting the formula up so that we have one formula per control point.

A*(1-t)^2 \\ B*2t(1-t) \\ C*t^2

Next, we remove the control points and expand each term to get:

1-2t+t^2 \\ 2t-2t^2 \\ t^2

Now, explicitly values of all powers of t that are present:
1*t^0-2*t^1+1*t^2 \\ 0*t^0+2*t^1-2*t^2 \\ 0*t^0+0*t^1+1*t^2

Now the final step. Take the constants that multiply your powers of t and make a matrix out of them. You are done!

\begin{bmatrix} 1 & -2 & 1 \\ 0 & 2 & -2 \\ 0 & 0 & 1 \\ \end{bmatrix}

Using the Matrix Form

Using the matrix form of Bezier curves is also pretty simple.

First, we need to make a vector of the power series of our t value:

powerSeries = \begin{bmatrix} t^0 & t^1 & t^2 \\ \end{bmatrix}

Which can also be written as:

powerSeries = \begin{bmatrix} 1 & t & t^2 \\ \end{bmatrix}

You also need a vector of your control points:

controlPoints = \begin{bmatrix} A & B & C \\ \end{bmatrix}

You next perform this operation to get a result vector:

result = powerSeries * curveMatrix * controlPoints

Then, you add up all components of result to get the value of the curve at time t.

value = result[0] + result[1] + result[2]

All done!

Note that this is a one dimensional Bezier curve. You need to do this operation once per axis to get your final multi dimensional Bezier curve point.

If you are confused by that last line, check out this post: One Dimensional Bezier Curves

Multiplying the Control Points In

You might notice that if you are evaluating several points on the same curve that you are going to be multiplying the curveMatrix matrix by the controlPoints vector over and over. You can multiply the control points into the Bezier curve matrix to make the specific matrix for those control points if you want to. You multiply the columns of the matrix by the control points, and adjust the result calculation like the below.

// Multiply the control points into the curve matrix
curveMatrix[0] *= A;
curveMatrix[1] *= B;
curveMatrix[2] *= C;

// Use the curve matrix that has the control points baked in, to do less math to get the result vector.
// You would calculate the curve matrix once and re-use it multiple times of course!
vec3 result = powerSeries * curveMatrix;
float value = result.x + result.y + result.z;

Closing

You might wonder when you’d use the matrix form. One time to use the matrix form would be when you had fast matrix math support (like on the GPU). Another time to use the matrix form though is if you ever want to cut up a Bezier curve into multiple smaller sub curves. The matrix form can help make that easier, and you can read more about that here if you want: A Matrix Formulation of the Cubic Bezier Curve

Here are some shadertoys that show this all working in webgl/glsl pixel shaders, along with source code:

Shadertoy: 1D Linear Bezier Matrix Form
Shadertoy: 1D Quadratic Bezier Matrix Form
Shadertoy: 1D Cubic Bezier Matrix Form

Actually Making Signed Distance Field Textures With JFA

This post is an addendum to the last post where I say that you can make distance field textures with JFA but don’t fully explain how to make SIGNED distance field textures, which is what you really want.

If you want to go straight to a working demo with webgl pixel shader source code, here is the shadertoy: Shadertoy: JFA SDF Texture

If you naively use a distance transform to make a distance field texture, you’ll get an UNSIGNED distance field texture, where you only have the distance to the surface of the object from the outside, but won’t have the distance to the surface of the object from the inside.

This is important because signed distance field textures have both, and use bilinear interpolation of distance on each side of the shape surface to make a nice smooth line. Below is what happens when you try to use an unsigned distance field texture (aka the distance transform of the image, using JFA / Voronoi information), using the zero distance line as the surface of the object:

It looks ok (if not fairly pixelated), but you can really see it break down when you zoom in:

So you might say to yourself, maybe i need to keep the surface line at distance 0.5 instead of 0.0 so that there is distance information to interpolate? If you do that, the first thing you might notice is that the objects get fatter:

But it does look better when you zoom in, which is a plus:

The real issue is that you really just need the distance from each pixel to the surface of the object from both the inside and the outside. In our case, our Voronoi diagram we make with JFA only gives the distance from the outside. So what is the solution? At first I was thinking maybe you can get the gradient of this data at the point of each pixel and “push the zero line in” a little bit to give at least one pixel layer worth of inside data. However, a brilliant friend of mine came up with the actual solution: You invert your source data so empty space becomes seed, and seed becomes empty space, and you run JFA again to get the distance from the inside!

That actually works very well. It’s also very easy to combine them. You make a pixel shader that reads the data from the outside Voronoi diagram and the inside Voronoi diagram, calculate the output distance (0.5 + outsideDistance * 0.5 – insideDistance * 0.5), and output that 0 to 1 distance value in one or more of the color channels.

Here’s a glsl excerpt below, note that we divide the distance by 8 and clamp between 0 and 1 so that the data is suitable for a normalized color image (normalized as in the color channels can store values between 0 and 1):

// calculate distances from seed coordinates
float outsideDist = clamp(length(outsideSeedCoord-fragCoord) / 8.0, 0.0, 1.0);
float insideDist  = clamp(length(insideSeedCoord-fFragCoord)  / 8.0, 0.0, 1.0);
    
// calculate output distance
float signedDistance = 0.5 + outsideDist * 0.5 - insideDist * 0.5;
        
// set the color based on that distance
fragColor = vec4(signedDistance);

It actually looks a lot like the first image where we use the zero distance line of the unsigned distance field texture, so we still aren’t quite there:

When you zoom in, it looks a little better, but something still seems a bit off:

The final step to making this look good is to realize that the power of signed distance textures is in their ability to interpolate distance information well. When we have a full resolution texture, there is no interpolation going on. We actually need to decrease the size of our distance field texture to make it look better. If only all problems were solved by making textures smaller!

Here is the resulting image when making the distance field texture 1/4 as large on each axis (1/16th as big total):

And zooming in you can see that it scales very well. The zoom is a 20x magnification, on top of the magnification we already get from it being a larger texture:

And just to show the intermediary textures, here is the outside distance Voronoi diagram:

And the inside distance Voronoi diagram (The seed is in bright green, the dim green is the empty space that has distance information):

And here is the final distance field texture used to render the final result I showed above.

Zoomed in to show just how low resolution it is! This is the thing that looks like a + or a sword just left of middle.

Again, here is the shadertoy that does this technique, generating a signed distance field texture on the fly for randomly placed objects, and then using that signed distance field to render a larger image that you can further zoom in to:
Shadertoy: JFA SDF Texture