External C++ Header Guards

A buddy at work said “I wish C++ had a two pass pre processor so that we could do external header guards”. It got me thinking about some random macro stuff i had seen before and i thought “hrm… you know, that actually might be possible to do… i’m going to give it a try”.

I ended up working something up tonight at home that’s semi-palatable. The way you use it is a little weird, but i think it satisfies the spirit of the challenge, and works as a proof of concept that you can do external header guards without having to type a bunch of stuff.

If you can think of a way to improve it, post a comment or something and let me know!!

Umm…External Header Guards? What are you Talking About??

Have you ever seen something like the below? it’s called a header guard:

#ifndef BLAH_H
#define BLAH_H

// code goes here

#endif BLAH_H

You might also have seen this variation:

#pragma once

Without the header guards, if you include the header file twice, it will complain that the classes etc have already been defined. Those make sure that doesn’t happen, by only including the contents of the file if it hasn’t already been included.

External header guards on the other hand would be guards in the place it’s included instead of in the header file itself. That is more typing (more work), but the benefit there is that the compiler doesn’t have to open the header file at all to see if it’s already been included, which could make for faster compile times in large projects.

Anyways, here’s the code:

Main.cpp

// include testoneblah.h which defines the typedef ProofIncluded__blah_h
// to prove that it was really included
#define FILESEQ (testone)(blah)
#include "Includer.h"

// try to include testoneblah.h again.  It won't get included again, and
// instead, ProofIncludeBlocked__blah_h will get typedef'd by includer.h
// to prove that the file did not get included.  Comment these lines out
// and you'll get a compiler error that ProofIncludeBlocked__blah_h is
// an undeclared identifier
#define FILESEQ (testone)(blah)
#include "Includer.h" 

int main(int argc, char **argv)
{
	ProofIncluded__blah_h a;
	ProofIncludeBlocked__blah_h b;
	return 0;
}

So it is a little weird… but to include a file, you define FILESEQ with a directory and filename (without .h on it), and then include “Includer.h”. Even though it’s weird to use, and doesn’t work for .inl files (and maybe other issues, some easily solved), it’s only one extra line of typing to do an external header guard, which is about as good as you can expect.

Ideally I wish the interface were like the below, but I haven’t been able to figure out how to make that work unfortunately.

IncludeFile_((testone)(blah))

Includer.h

//=====================================================================================================
// Rip off boost, hooray!!  boost_pp is really nice, you can just grab it from the boost bundle and
// start using it because it's just a bunch of includes.  you don't need to build or link with boost
// at all.  It's really nice.  http://www.boost.org/
//=====================================================================================================
# define BOOST_PP_EMPTY()
# define BOOST_PP_SEQ_ELEM(i, seq) BOOST_PP_SEQ_ELEM_I(i, seq)
# define BOOST_PP_SEQ_ELEM_I(i, seq) BOOST_PP_SEQ_ELEM_II((BOOST_PP_SEQ_ELEM_ ## i seq))
# define BOOST_PP_SEQ_ELEM_II(res) BOOST_PP_SEQ_ELEM_IV(BOOST_PP_SEQ_ELEM_III res)
# define BOOST_PP_SEQ_ELEM_III(x, _) x BOOST_PP_EMPTY()
# define BOOST_PP_SEQ_ELEM_IV(x) x

# define BOOST_PP_SEQ_ELEM_0(x) x, BOOST_PP_NIL
# define BOOST_PP_SEQ_ELEM_1(_) BOOST_PP_SEQ_ELEM_0
//=====================================================================================================
#define EB_COMBINETEXT(a, b) EB_COMBINETEXT_INTERNAL(a, b)
#define EB_COMBINETEXT_INTERNAL(a, b) a ## b
#define EB_TOTEXT(a) EB_TOTEXT_INTERNAL (a)
#define EB_TOTEXT_INTERNAL(a) #a
//=====================================================================================================

// extract the directory and file name
#define DIR BOOST_PP_SEQ_ELEM(0, FILESEQ)
#define FILE BOOST_PP_SEQ_ELEM(1, FILESEQ)

// create the full file name: /.h
#define THEFILENAME EB_TOTEXT(EB_COMBINETEXT(DIR, EB_COMBINETEXT(/, EB_COMBINETEXT(FILE, .h))))

#if !EB_COMBINETEXT(__, EB_COMBINETEXT(FILE, _h))
	//including file: YES
	//include the file
	#include THEFILENAME
#else
	//including file: NO
    //create a typedef called ProofIncludeBlocked___H to prove we didn't include the file
	typedef char EB_COMBINETEXT(ProofIncludeBlocked__, EB_COMBINETEXT(FILE, _h));
#endif

// clean up the things we created.  boost macros can stick around ::shrug::
#undef THEFILENAME
#undef DIR
#undef FILE

// defined by caller, but we're cleaning it up for convenience
#undef FILESEQ

I ripped some macros out of the boost preprocessor library (boost_pp) to help things out a little bit. In a nutshell what we are doing is this…

We test to see if the preprocessor value __<File>_h is not true (false, or undefined). If that is the case, we include the file <Directory>/<File>.h. Else, we define a typedef ProofIncludeBlocked<File>_h to prove that we blocked the include from happening.

Blah.h

#define __blah_h 1
class ProofIncluded__blah_h {};

Blah.h defines __blah_h as 1 (true). It’s important that it uses the same naming convention as Includer.h does (__<File>_h), otherwise this setup won’t work. If you screw it up, you’ll get compile errors about multiply defined symbols.

This file also defines a class ProofIncluded__blah_h to prove that this file actually got included, and also defines something that will complain if the file is included twice.

Issues

So, this is just a proof of concept and it has some issue including…

  • Duplicate file names – if you have the same file name in different folders, this setup has issues. It might be able to be helped by including the directory name into the header guard preprocessor symbol.
  • Referencing the same file different ways – if you reference the same file in different ways because there’s multiple ways to reach it in the include search paths, it won’t be able to tell that it’s the same file if you do the last fix. Maybe the real solution is to have another parameter defined specifying the header guard symbol? dont know…
  • Only supports .h files – It assumes a .h extension, but maybe another parameter could be the file extension to use so you could include .inl, .hpp etc.

Hopefully you find it interesting at least though (:

Is it Worth It?

Poday made some good points in the comments about it not being worth it, and my friend Doug also has this to say:

It’s not needed though all compilers already do that:

(GCC):
The GNU C preprocessor is programmed to notice when a header file uses this particular construct and handle it efficiently. If a header file is contained entirely in a `#ifndef’ conditional, then it records that fact. If a subsequent `#include’ specifies the same file, and the macro in the `#ifndef’ is already defined, then the file is entirely skipped, without even reading it.

(Clang)
The MultipleIncludeOpt class implements a really simple little state machine that is used to detect the standard “#ifndef XX / #define XX” idiom that people typically use to prevent multiple inclusion of headers. If a buffer uses this idiom and is subsequently #include‘d, the preprocessor can simply check to see whether the guarding condition is defined or not. If so, the preprocessor can completely ignore the include of the header.

(Clang still)
clang_isFileMultipleIncludeGuarded – Determine whether the given header is guarded against multiple inclusions, either with the conventional #ifndef/#define/#endif macro guards or with #pragma once.

For MSVC all I could find is Herb Sutter lead architect for MSVC and head of the C++ committee in his book ‘C++ Coding Standards: 101 Rules, Guidelines, and Best Practices’:
24. Always write internal #include guards. Never write external #include guards.
With a reason of:
Don’t try to be clever: Don’t put any code or comments before and after the guarded portion, and stick to the standard form as shown.
Today’s preprocessors can detect include guards, but they might have limited intelligence and expect the guard code to appear exactly at the beginning and end of the header.

Alloca and Realloc – Useful Tools, Not Ancient Relics

If you are a C/C++ programmer, you are likely familiar with malloc() and free(), the predecessors to C++’s new and delete operators, as well as the existence of the variations of malloc such as calloc, realloc and alloca.

If you are like me, you probably thought for a long while that malloc and it’s variations were relics of days gone by, only actually useful in a few very limited situations. Some of these guys still have use though, and don’t really have equivalents in C++ to replace them.

First the boring ones…
malloc – Allocates memory. Precursor to new operator.
calloc – Allocates memory and sets the contents to zero. C’s answer to the problem of uninitialized memory that constructors solve in C++.

Now the more interesting ones!

Alloca

Believe it or not, alloca actually allocates memory on the stack. When your function goes out of scope, the stack memory is automatically returned to the stack due to the nature of how the stack and stack pointer work. No need to free the memory allocated with alloca, and in fact if you tried, you’d probably get a crash 😛

If you are a programmer who writes high performance applications, you are probably familiar with the benefits of using the stack instead of allocating memory on the heap with new or malloc.

The benefits of using the stack include…

  • Quicker allocations – Allocating memory can be a relatively costly operation in terms of time, especially if you have multiple threads running using the same (thread safe) allocator. Allocating memory on the stack is essentially the same cost as defining a local variable. Under the hood, it’s just moving the stack pointer a little farther and gives you that empty space to use.
  • No memory leaks – when the function you’ve allocated the stack memory in exits, that memory is automatically freed. This is because the stack pointer just “moves back” to what it used to be. There is not really any memory to free.
  • Less memory fragmentation – When mixing large and small memory allocations and frees, sometimes you end up with your memory in a state where there is a lot of memory free, but just not all together in one place. For instance, your program might need to allocate 50MB, and there may be 300MB free on the heap total, but if there are small 16 byte allocations littered in the memory every 10MB, your program won’t be able to find a single 50MB region to allocate and the allocation will fail. One common cause of this problem is small allocations used for things like relatively small arrays or small string buffer allocations that exist temporarily to copy or transform some data, but are not meant to stick around very long. If you can put these on the stack instead of the heap, those small allocations don’t hit the heap, and your memory will be less fragmented in the end.
  • Increased performance (fewer cache misses) – the contents of the stack are likely already in the CPU cache, so putting your data there means less information for the CPU to have to gather from RAM which is a slow operation.

However, there are some dangers when allocating memory on the stack as well

  • If you keep a pointer to the memory, that memory could be “freed” and re-used, getting filled with other random data (local variables). That can cause crashes, memory corruption or other strange program behavior.
  • If you allocate too much on the stack you could run out of stack space. The stack isn’t really meant to hold large amounts of allocated data. You can adjust your programs stack size though if this is a route you want to pursue.

Alternatives

There are some common techniques I’ve seen people use in places that could have also used alloca instead. These include…

  • Small Pool Allocators – To get around the memory fragmentation problem, sometimes people will have different memory allocators based on the size of memory being allocated. This way, small temporary allocations for things like temporary string buffers will all be allocated from one place, while larger allocations for things like textures will be allocated elsewhere. This dramatically improves the memory fragmentation issue.
  • Object Pools – Object pools are similar to small pool allocators but they work by allocating some amount of memory for specific types of objects, and have a way to remember which objects are used and which ones are free. For instance, you may dynamically allocate an array of 100 SMyStruct objects and have a flag for each to know which ones are in use and which ones aren’t. This way, the program can ask for a new object, and it can find one currently not in use and return it to the caller without needing to hit the ACTUAL memory allocator to get the data (unless all objects are spoken for, at which point it can choose to fail, or allocate a new “page” of objects to be able to hand out). This also has an interesting side effect that cache misses can drop quite a bit since the same kinds of objects will be nearer to eachother in memory.
  • DIY Stack Allocator – When I was working at Midway, a friend (Hi Shawn!) profiled the animation code and found that a lot of time was spent in allocating temporary buffers to blend bone data together. To fix this, he rolled his own stack allocator, where there was one contiguous piece of memory on the heap that could be allocated from. There was an internal index keeping track of where the “top of the stack” was, and when memory was allocated, that stack index would just move up by however many bytes were asked for. At the end of the frame, the stack index was reset to zero, thus “freeing” the memory. This dramatically improved the animation system performance by making the temporary bone blend buffer allocations essentially free.
  • Thread Specific Memory – If you are having problems where multiple threads are trying to allocate memory at the same time, causing contention and slowdowns due to thread synchronization, another option is to give each thread it’s own chunk of memory and let it allocate from that. That way there is no contention and you won’t have the slowdown of thread synchronization due to memory allocation anymore. A problem here though can be figuring out how much memory each thread needs. One thread may need a lot of memory, and another thread may need none, and you may not have any way of knowing which in advance. In this case, you’d have to allocate “a lot” of memory for each thread in advance, and pay an extra cost in memory that you technically don’t have to. But hey, at least it’s fast, maybe the trade off is worth it in your situation!

Lastly, there’s another common trick to avoid dynamic allocations involving templates, check it out!

// define the CStaticArray class
template 
class CStaticArray
{
public:
  T m_values[N];

  // you could put functions in here to do operations on the array data to make it look more like a standard
  // data type, instead of a plain vanilla array
  unsigned int Count () { return N; }

  void SomeOtherFunction () { }
};

void MyFunc ()
{
  // make an array of 32 floats
  CStaticArray m_floatArray;

  // make an array of 128 SSomeStructs
  CStaticArray m_objectArray;

  for (unsigned int index = 0; index < m_objectArray.Count(); ++index)
  {
    m_objectArray.m_values[index].DoSomething();
  }
}

The above really shines if you have a standard API for strings or dynamic arrays in your code base. You can make a version like the above which works without dynamic allocations, but gives the same interface so it's easier for fellow programmers to use and swap in and out as needed.

Another nice benefit to the above technique is that it works for stack allocations, but you can also make them member variables of other objects. In this way, you can minimize dynamic allocations. Instead of having to dynamically allocate an object, and then dynamically allocate the array inside of it, you do a single allocation to get all the memory needed.

That is the closest thing in C++ that I've seen to alloca, but even so, alloca has the advantage that you can decide how much memory to allocate at run time. With the template method, you have to know at compile time which is fine for a lot of cases, but othertimes is a deal breaker, forcing you to have to go back to dynamic allocations (or perhaps now, alloca instead?)

Realloc

Realloc is the other interesting memory allocation function.

Like I was mentioning above, the fewer allocations you can do, the better off you are in terms of performance, and also memory fragmentation.

By writing smart containers (dynamic arrays, dynamic strings, etc) you can make it so when someone tries to make a container smaller, that instead of allocating new memory that’s smaller, copying the data over, and freeing the old memory, that instead it just remembers the new size but keeps the old, larger memory around.

Then later on, if the container was told to grow, if it was smaller than the larger size from the past, it could just use some of that old memory again.

However, if that container grows larger than it used to be, you are going to have to allocate, copy, and free (costly etc) to grow the container.

Down in the guts of your computer however, there may be memory right after the current memory that’s not being used by anything else. Wouldn’t it be great if you could just say “hey… use that memory too, i don’t want to reallocate!”.

Well, realloc does ALL of the above for you without you having to write special code.

When you realloc memory, you give the old pointer and the new size, and if it’s able to, it won’t do any allocations whatsoever, and will just return you your old pointer back to you. It may allocate the next memory block for you if the new size is larger, but would still return the old pointer value in this case. Or, if the new amount of memory is smaller, it may return you back the same memory without doing anything internally (it depends on your compiler’s specific implementation of realloc what it does when)

If realloc does have to allocate new memory though, it will copy over all the old data to the new memory that it returns to you and free the old memory. So, you don’t have to CARE whether the pointer returned is old or new, just store the return value and continue on with your life.

It’s pretty cool and can help reduce actual memory allocations, lowering memory fragmentation and increasing performance.

Is pre-increment really faster than post increment? Part 2

In the first part of this blog post (Is pre-increment really faster than post increment? Part 1) I showed that it really doesn’t seem to matter if you use post or pre increment with simple integer types.

I then promised an example of where the choice of pre or post increment DOES matter, and here it is.

The long and the short of it is this…

  • When you pre increment a variable, you are changing the value of a variable, and the new value will be used for whatever code the pre increment is part of.
  • When you post increment a variable, you are changing the value of a variable, but the OLD value is used for whatever code the post increment is part of.

To make post increment work that way, it essentially needs to make a copy of the variable before the increment and return the copy for use by the code of which the post increment is a part of.

A pre increment has no such need, it modifies the value and everything uses the same variable. There is no copy needed.

The compiler / optimizer apparently does a good job of figuring out when it does or does not need to make a copy of integral types, but it doesn’t do as well with complex objects. Here’s some sample code to demonstrate this and the output that it generates in both debug and release.

Test Code

#include 
#include 
#include 

//=============================================================
class CScopeMessage
{
public:
	CScopeMessage(const char *label)
	{
		PrintIndent();
		printf("%srn", label);
		s_indentLevel++;
	}

	CScopeMessage(const char *label, int objectId)
	{
		PrintIndent();
		printf("%s obj %irn", label, objectId);
		s_indentLevel++;
	}

	CScopeMessage(const char *label, int objectId, int copyObjectId)
	{
		PrintIndent();
		printf("%s (obj %i) to obj %irn", label, objectId, copyObjectId);
		s_indentLevel++;
	}

	~CScopeMessage()
	{
		s_indentLevel--;
	}

	static void StartNewTest()
	{
		s_indentLevel = 0;
		s_lineNumber = 0;
		printf("rn");
	}

private:
	void PrintIndent()
	{
		s_lineNumber++;
		printf("%2i:  ", s_lineNumber);
		for(int index = 0; index < s_indentLevel; ++index)
			printf("  ");
	}

private:
	static int s_indentLevel;
	static int s_lineNumber;
};

int CScopeMessage::s_indentLevel = 0;
int CScopeMessage::s_lineNumber = 0;

//=============================================================
class CTestClass
{
public:
	CTestClass()
	{
		m_objectID = s_objectID;
		s_objectID++;

		// this is just noise in the test, but feel free to
		// comment out if you want to see for yourself
		//CScopeMessage msg("Constructing", m_objectID);
		m_value = new char[4];
		strcpy(m_value, "one");
	}

	CTestClass(const CTestClass &other)
	{
		m_objectID = s_objectID;
		s_objectID++;

		CScopeMessage msg("Copy Constructing", other.m_objectID, m_objectID);
		m_value = new char[strlen(other.m_value) + 1];
		strcpy(m_value, other.m_value);
	}

	~CTestClass()
	{
		CScopeMessage msg("Destroying", m_objectID);
		delete[] m_value;
	}

    // preincrement
	CTestClass &operator++()
	{
		CScopeMessage msg("Pre Increment", m_objectID);
		DoIncrement();
		return *this;
	}
 
	// postincrement
	CTestClass operator++(int)
	{
		CScopeMessage msg("Post Increment", m_objectID);
		CTestClass result(*this);
		DoIncrement();
		return result;
	}

	void DoIncrement()
	{
		CScopeMessage msg("Doing Increment", m_objectID);
	}

private:
	char *m_value;
	int m_objectID;

	static int s_objectID;
};

int CTestClass::s_objectID = 0;

//=============================================================
int main (int argc, char **argv)
{
	CTestClass test;
	{
		CScopeMessage msg("--Post Increment--");
		test++;
	}

	CScopeMessage::StartNewTest();
	{
		CScopeMessage msg("--Post Increment Assign--");
		CTestClass testB = test++;
	}

	CScopeMessage::StartNewTest();
	{
		CScopeMessage msg("--Pre Increment--");
		++test;
	}

	CScopeMessage::StartNewTest();
	{
		CScopeMessage msg("--Pre Increment Assign--");
		CTestClass testB = ++test;
	}

	system("pause");
	return 0;
}

Debug

Here’s the debug output:

prepostdebug

You can see that in the post increment operator, it calls the copy constructor not once but twice! The first copy constructor is called to create the “result” object, and the second copy constructor is called to return it by value to the caller.

CTestClass operator++(int)
{
	CScopeMessage msg("Post Increment", m_objectID);
	CTestClass result(*this);
	DoIncrement();
	return result;
}

Note that it can’t return the copy by reference because it’s a local variable. C++11’s “std::move” and xvalue type functionality is there to help with this stuff, but if you can’t use that tech yet, it isn’t much help hehe.

Interestingly, we can see that 2 copy constructors get called whether or not we assign the value returned or not.

On the pre-increment side, you can see that it only does a copy construction call if you assign the result. This is nice and is what we want. We don’t want extra object copies or memory allocations and deallocations.

Release

prepostrelease

Things are a little bit better in release, but not by much. The optimizer seems to have figured out that it doesn’t really need to do 2 object copies, since it only ever wants at most one REAL copy of the object, so it gets away with doing one object copy in both situations instead of two.

That’s an improvement, but still not as good as the pre-increment case which hasn’t visibly changed in release (not sure about the assembly of these guys, but if you look and find something interesting, post a comment!)

Summary

As always, you should check your own assembled code, or test your compiler with printf output like this post does to ensure you really know what your code is doing.

But… it seems like you might want to use pre-increment if you ever use increment operators for heavy weight objects (such as iterators), but if you want to use post increment for integral types, it ought to be fine.

That said, a lot of people say “just use pre-increment” because at worst it’ll be the same as a post increment, but at best it will be a lot more efficient.

You do whatever you want, so long as you are aware of the implications of going one way or the other 😛

A Super Tiny Random Number Generator

When I posted the last blog post about shuffling on the GameProgrammer.com mailing list, someone responded back with a super tiny random number generator that is actually pretty damn good. It is this:

x+=(x*x) | 5;

The high bit of X is the source of your random numbers, so if you want to generate an 8 bit random number, you have to call it 8 times. Apparently it passes a lot of tests for randomness really well and is a pretty high quality PRNG. Check this out for more info: http://www.woodmann.com/forum/showthread.php?3100-super-tiny-PRNG

You can start x at whatever you want, but it might take a few iterations to “warm up” especially if you start with a small seed (ie you may want to throw away the first 5-10 random bits it generates as they may not be that random). I adapted it into an example program below, along with some example output. I use time() to set the initial value of x.

#include 
#include 
#include 
#include 

// A super tiny prng
// http://www.woodmann.com/forum/showthread.php?3100-super-tiny-PRNG
//
unsigned int seed = 0;
unsigned int GenerateRandomBit()
{
	seed += (seed * seed) | 5;
	return seed & 0x80000000;
}

template 
void GenerateRandom(T& value)
{
	memset(&value, 0, sizeof(T));
	const unsigned int numBits = sizeof(T) * 8;
	unsigned int* dataPointer = (unsigned int *)&value;
	for (unsigned int index = 0; index < numBits; ++index)
	{
		if(GenerateRandomBit()) {
			unsigned int pointerIndex = index / 32;
			unsigned int mask = 1 << index % 32;
			dataPointer[pointerIndex] |= mask;
		}
	}
}

int main(int argc, char **argv)
{
	seed = (unsigned int)time(NULL);
	printf("seed = %urn", seed);

	printf("9 random uints...rn");

	for (unsigned int index = 0; index < 9; ++index)
	{
		unsigned int random;
		GenerateRandom(random);
		printf("%2u: %10u (%x)rn", index, random, random);
	}

	printf("3 random floats...rn");
	for (unsigned int index = 0; index < 3; ++index)
	{
		float f;
		GenerateRandom(f);
		printf("%2u: %f (%x)rn", index, f, *((unsigned int*)&f));
	}

	printf("8 random characters...rn");
	char text[8];
	GenerateRandom(text);
	for (unsigned int index = 0; index < 8; ++index)
	{
		printf("%2u: %crn", index, text[index]);
	}
	system("pause");
	return 0;
}

tinyprng1

tinyprng2

tinyprng3

tinyprng4

Fast & Lightweight Random “Shuffle” Functionality – FIXED!

In this post I’m going to show a way to make an iterator that will visit items in a list in a random order, only visit each item once, and tell you when it’s visited all items and is finished. It does this without storing a shuffled list, and it also doesn’t have to keep track of which items it has already visited.

This means you could have a list that is a million items long and no matter how many you have visited, it will only take two uint32s to store the current state of the iterator, and it will be very fast to get the next item in the list (somewhere around 1 to 4 times the cost of calling the rand() function!).

This is a follow up post to an older post called Fast & Lightweight Random “Shuffle” Functionality.

In that older post, things didn’t work quite like I expected them to so it was back to the drawing board for a little while to do some research and experimentation to find a better solution. I’m not going to go back over the basics I talked about in that article so go back and have a read there first if anything is confusing.

High Level

In the last post on this topic we talked about how the high level goal was to map the index to a random number, and because we were randomizing the bits in a deterministic (and non destructive) way, we needed to iterate over the whole “next power of 2” items and reject any that were too large. Only doing this could we be sure that we visited every index. The problem I hit last time though was that I could not get the numbers to look random enough.

To solve this, i decided what i needed to do was ENCRYPT the index with a block cipher. When you encrypt data, it should come out looking like random data, even though the data you put in may be sequential or have other easily seen patterns. What else is great, is that when using a block cipher, each unique input should equate to a unique output which means that if we encrypt the full power of 2 range as input, we will get the full power of 2 range out as output, but just in a different order.

Once I realized this was a good solution, my next problem to tackle was that I knew of no block algorithms that would work for a variable number of bits. There are block cipher algorithms that will work for LARGE number of bits, but there is no algorithm I knew of where you can tell it “hey, i only want to use 4 bits” or “i only want to use 5 bits”.

In the end the answer I found was to roll my own, but use existing, well established technology to do so. In my specific case, I’m also aiming for high speed functions since I want this functionality used in real time gaming applications.

What I came up with in the end is not cryptographically secure, but using the same techniques I have laid out, you should be able to drop in a different block cipher algorithm if that is a requirement.

Feistel Network

As it turns out, there is a basic primitive of cryptography called a Feistel Network. It’s used by quite a number of modern ciphers and hash functions, and it is surprisingly simple.

For a balanced Feistel Network, you split the data into a left and a right side, and do the following, which consists of a single round (you can do as many rounds as you like):

Left[i+1]  = Right[i];
Right[i+1] = Left[i] ^ RoundFunction(Right[i], key);

After performing however many rounds you wish, you combine the left and the right sides again to get your encrypted data.

To unencrypt, the feistel network works much the same but only in reverse, looking like the below:

Right[i] = Left[i+1];
Left[i] = Right[i+1] ^ RoundFunction(Left[i+1], key);

Check out the wikipedia page if you are interested in more info.

The neat thing about Feistel Networks is that the round function can be any deterministic function that performs whatever operations it wants – even destructive and irreversible operations such as division or bit shifts. Even so, the feistel network as a whole is reversible, no matter what you do in the round function, and you can unencrypt to get your origional data back.

This threw me for quite a loop and I couldn’t get my head around why this worked for a long while until I found a webpage that explained it pretty well. unfortunately I lost the link and cannot find it again but the basic idea is this… For each round of encryption, the right side is encrypted using the key and the left side. This means that at any point, no matter how many rounds you’ve done on your data, the right side should be able to be decrypted using the key and the left side. If you have the key, and you know how many rounds were used in encryption, you have all the data you need to decrypt it again. Hopefully that makes sense… I had to work it out on paper a little bit to see it fully.

The other great thing about Feistel Networks is that you can make them be however many bits you want. So, if i want each side of the Feistel Network to be 1 bit, I can do that. Or, if i want each side to be 128 bits, I can do that too!

You can also tune the quality / performance a bit by doing less or more rounds.

BTW the Tiny Encryption Algorithm uses a Feistel Network if you want to see a simple example in action.

With the “variable bit size support” problem solved, next I needed to come up with a round function that did a good job of taking sequential numbers as input and spitting out seemingly random numbers as output. Thanks to what I was saying before, the round function doesn’t need to be reversible, so there are a lot of options available.

I ended up deciding to go with a hash function, specifically Murmur Hash 2 (which I actually also used in my last post if you’d like to see some more information on it! The Incredible Time Traveling Random Number Generator).

Since the hash spits out numbers that might be anything in the range of an unsigned int, but I only want N bits, I just AND the hash against a mask to get the number of bits I want. There’s probably a higher quality method of smashing down the bits using XOR or something, but my quality needs aren’t very high so I just opted to AND it.

A downside of going with the balanced Feistel Network approach is that before this, I only had to round up to the next power of 2, but now, since each half of the data needs to be a power of 2, I actually have to make sure I have an even number of bits and have to round up to the next power of 4. This means that when it’s looking for valid indices to return in the shuffle, it may have to calculate up to 4 different indices on average before it finds a valid one. Not the greatest thing in the world, but also not the worst and definitely not a deal breaker in my eyes.

The Code

At long last, here is the code! Use it in good health (:

There are some example runs of the program below it as well.

#include 
#include 
#include 

// MurmurHash code was taken from https://sites.google.com/site/murmurhash/
//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby

// Note - This code makes a few assumptions about how your machine behaves -

// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4

// And it has a few limitations -

// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
	// 'm' and 'r' are mixing constants generated offline.
	// They're not really 'magic', they just happen to work well.

	const unsigned int m = 0x5bd1e995;
	const int r = 24;

	// Initialize the hash to a 'random' value

	unsigned int h = seed ^ len;

	// Mix 4 bytes at a time into the hash

	const unsigned char * data = (const unsigned char *)key;

	while(len >= 4)
	{
		unsigned int k = *(unsigned int *)data;

		k *= m; 
		k ^= k >> r; 
		k *= m; 
		
		h *= m; 
		h ^= k;

		data += 4;
		len -= 4;
	}
	
	// Handle the last few bytes of the input array

	switch(len)
	{
	case 3: h ^= data[2] << 16;
	case 2: h ^= data[1] <> 13;
	h *= m;
	h ^= h >> 15;

	return h;
}

struct SShuffler
{
public:
	SShuffler(unsigned int numItems, unsigned int seed)
	{
		// initialize our state
		m_numItems = numItems;
		m_index = 0;
		m_seed = seed;

		// calculate next power of 4.  Needed sice the balanced feistel network needs
		// an even number of bits to work with
		m_nextPow4 = 4;
		while (m_numItems > m_nextPow4)
			m_nextPow4 *= 4;

		// find out how many bits we need to store this power of 4
		unsigned int numBits = 0;
		unsigned int mask = m_nextPow4 - 1;
		while(mask)
		{
			mask = mask >> 1;
			numBits++;
		}

		// calculate our left and right masks to split our indices for the feistel 
		// network
		m_halfNumBits = numBits / 2;
		m_rightMask = (1 << m_halfNumBits) - 1;
		m_leftMask = m_rightMask << m_halfNumBits;
	}

	void Restart()
	{
		Restart(m_seed);
	}

	void Restart(unsigned int seed)
	{
		// store the seed we were given
		m_seed = seed;

		// reset our index
		m_index = 0;
	}

	// Get the next index in the shuffle.  Returning false means the shuffle
	// is finished and you should call Restart() if you want to start a new one.
	bool Shuffle(unsigned int &shuffleIndex)
	{
		// m_index is the index to start searching for the next number at
		while (m_index < m_nextPow4)
		{
			// get the next number
			shuffleIndex = NextNumber();

			// if we found a valid index, return success!
			if (shuffleIndex  1)
		{
			// get the last number
			shuffleIndex = LastNumber();

			// if we found a valid index, return success!
			if (shuffleIndex > m_halfNumBits;
		unsigned int right = (index & m_rightMask);

		// do 4 feistel rounds 
		for (int index = 0; index < 4; ++index)
		{
			unsigned int newLeft = right;
			unsigned int newRight = left ^ (MurmurHash2(&right, sizeof(right), m_seed) & m_rightMask);
			left = newLeft;
			right = newRight;
		}

		// put the left and right back together to form the encrypted index
		return (left << m_halfNumBits) | right;
	}

private:

	// precalculated values
	unsigned int m_nextPow4;
	unsigned int m_halfNumBits;
	unsigned int m_leftMask;
	unsigned int m_rightMask;

	// member vars
	unsigned int m_index;
	unsigned int m_seed;
	unsigned int m_numItems;

	// m_index assumptions:
	//   1) m_index is where to start looking for next valid number
	//   2) m_index - 2 is where to start looking for last valid number
};

// our songs that we are going to shuffle through
const unsigned int g_numSongs = 10;
const char *g_SongList[g_numSongs] =
{
	" 1. Head Like a Hole",
	" 2. Terrible Lie",
	" 3. Down in It",
	" 4. Sanctified",
	" 5. Something I Can Never Have",
	" 6. Kinda I Want to",
	" 7. Sin",
	" 8. That's What I Get",
	" 9. The Only Time",
	"10. Ringfinger"
};

int main(void)
{
	// create and seed our shuffler.  If two similar numbers are hashed they should give
	// very different results usually, so for a seed, we can hash the time in seconds,
	// even though that number should be really similar from run to run
    unsigned int currentTime = time(NULL);
    unsigned int seed = MurmurHash2(&currentTime, sizeof(currentTime), 0x1337beef);
	SShuffler shuffler(g_numSongs, seed);

	// shuffle play the songs
	printf("Listen to Pretty Hate Machine (seed = %u)rn", seed);
	unsigned int shuffleIndex = 0;
	while(shuffler.Shuffle(shuffleIndex))
		printf("%srn",g_SongList[shuffleIndex]);

	system("pause");
	return 0;
}

shuf1

shuf2

shuf3

shuf4

The Incredible Time Traveling Random Number Generator

It isn’t very often that you need a pseudo random number generator (PRNG) that can go forwards or backwards in time, or skip to specific points in the future or the past. However, if you are ever writing a game like Braid and do end up needing one, here’s one way to do it.

At the core of how this is going to work, we are going to keep track of an index, and have some way to convert that index into a random number. if we want to move forward in time, we will just increment the index. If we want to move backwards in time, we will just decrement the index. It’s super simple from a high level, but how are we going to convert an index into a random number?

There are lots of pseudo random number generators out there that we could leverage for this purpose, the most famous being C++’s built in “rand()” function, and another one famous in the game dev world is the Mersenne Twister.

I’m going to do something a little differently though as it leads well into the next post I want to write, and may be a little bit different than some people are used to seeing; I want to use a hash function.

Murmur Hash 2

Good hash functions have the property that small changes in input give large changes in output. This means that if we hash the number 1 and then hash the number 2, that they ought not to be similar output, they ought to be wildly different numbers in the usual case. Sometimes, just like real random numbers, we might get 2 of the same numbers in a row, but that is the desired behavior to have the output act like real random sequences.

There are varying levels of quality of hash functions, ranging from a simple string “hash” function of using the first character of a string (super low quality hash function, but super fast) all the way up to cryptographic quality hash functions like MD5 and SHA-1 which are a lot higher quality but also take a lot longer to generate.

In our usage case, I’m going to assume this random number generator is going to be used for game use, where if the player can discover the pattern in the random numbers, they won’t be able to gain anything meaningful or useful from that, other than at most be able to cheat at their own single player game. However, I really do want the numbers to be fairly random to the casual eye. I don’t want visible patterns to be noticeable since that would decrease the quality of the gameplay. I would also like my hash to run as quickly as possible to keep game performance up.

Because of that level of quality I’m aiming for, I opted to go with a fast, non cryptographic hash function called Murmur Hash 2. It runs pretty quick and it gives pretty decent quality results too – in fact the official Murmur Hash Website claims that it passes the Chi Squared Test for “practically all keysets & bucket sizes”.

If you need a higher quality set of random numbers, you can easily drop in a higher quality hash in place of Murmur Hash. Or, if you need to go the other way and have faster code at the expensive of random number quality, you can do that too.

Speed Comparison

How fast is it? Here’s some sample code to compare it vs C++’s built in rand() function, as well as an implementation of the Mersenne Twister I found online that seems to preform pretty well.

#include 
#include 
#include 
#include 
#include "tinymt32.h" // from http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/TINYMT/index.html

// how many numbers to generate
#define NUMBERCOUNT 10000000  // Generate 10 million random numbers

// profiling macros
#define PROFILE_BEGIN 
{ 
	LARGE_INTEGER freq; 
	LARGE_INTEGER start; 
    QueryPerformanceFrequency(&freq); 
    QueryPerformanceCounter(&start); 
		
#define PROFILE_END(label) 
	LARGE_INTEGER end; 
	QueryPerformanceCounter(&end); 
	printf(label " - %f msrn", ((double)(end.QuadPart - start.QuadPart)) * 1000.0 / freq.QuadPart); 
}

// MurmurHash code was taken from https://sites.google.com/site/murmurhash/
//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby

// Note - This code makes a few assumptions about how your machine behaves -

// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4

// And it has a few limitations -

// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
	// 'm' and 'r' are mixing constants generated offline.
	// They're not really 'magic', they just happen to work well.

	const unsigned int m = 0x5bd1e995;
	const int r = 24;

	// Initialize the hash to a 'random' value

	unsigned int h = seed ^ len;

	// Mix 4 bytes at a time into the hash

	const unsigned char * data = (const unsigned char *)key;

	while(len >= 4)
	{
		unsigned int k = *(unsigned int *)data;

		k *= m; 
		k ^= k >> r; 
		k *= m; 
		
		h *= m; 
		h ^= k;

		data += 4;
		len -= 4;
	}
	
	// Handle the last few bytes of the input array

	switch(len)
	{
	case 3: h ^= data[2] << 16;
	case 2: h ^= data[1] <> 13;
	h *= m;
	h ^= h >> 15;

	return h;
}

void RandTest()
{
	for(int index = 0; index < NUMBERCOUNT; ++index)
		int i = rand();
}

unsigned int MurmurTest()
{
	unsigned int key = 0;
	for(int index = 0; index < NUMBERCOUNT; ++index)
		key = MurmurHash2(&key,sizeof(key),0);
	return key;
}

// g_twister is global and inited in main so it doesnt count towards timing
tinymt32_t g_twister; 
unsigned int TwisterTest()
{
	unsigned int ret = 0;
	for(int index = 0; index < NUMBERCOUNT; ++index)
		ret = tinymt32_generate_uint32(&g_twister);
	return ret;
}

int main(int argc, char**argv)
{
	// rand() test
	PROFILE_BEGIN;
	RandTest();
	PROFILE_END("rand()");

	// hash test
	unsigned int murmurhash;
	PROFILE_BEGIN;
	murmurhash = MurmurTest();
	PROFILE_END("Murmur Hash 2");

	// twister test
	g_twister.mat1 = 0;
	g_twister.mat2 = 0;
	tinymt32_init(&g_twister, 0);
	unsigned int twister;
	PROFILE_BEGIN;
	twister = TwisterTest();
	PROFILE_END("Mersenne Twister");

	// show the results
	system("pause");

	// this is here so that the murmur and twister code doesn't get optimized away
	printf("%u %urn", murmurhash, twister);

	return 0;
}

Here's the output of that code run in release on my machine, generating 10 million random numbers of each type. You can see that murmurhash takes about 1/3 as long as rand() but is not quite as fast as the Mersenne Twister. I ran this several times and got similar results, so all in all, Murmur Hash 2 is pretty fast!

mrmrrandperf

Final Code & Sample Output

Performance looks good but how about the time traveling part, and how about seeing some example output?

Here’s the finalized code:

#include 
#include 
#include 

// MurmurHash code was taken from https://sites.google.com/site/murmurhash/
//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby

// Note - This code makes a few assumptions about how your machine behaves -

// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4

// And it has a few limitations -

// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
	// 'm' and 'r' are mixing constants generated offline.
	// They're not really 'magic', they just happen to work well.

	const unsigned int m = 0x5bd1e995;
	const int r = 24;

	// Initialize the hash to a 'random' value

	unsigned int h = seed ^ len;

	// Mix 4 bytes at a time into the hash

	const unsigned char * data = (const unsigned char *)key;

	while(len >= 4)
	{
		unsigned int k = *(unsigned int *)data;

		k *= m; 
		k ^= k >> r; 
		k *= m; 
		
		h *= m; 
		h ^= k;

		data += 4;
		len -= 4;
	}
	
	// Handle the last few bytes of the input array

	switch(len)
	{
	case 3: h ^= data[2] << 16;
	case 2: h ^= data[1] <> 13;
	h *= m;
	h ^= h >> 15;

	return h;
}

class CReversablePRNG
{
public:
	CReversablePRNG()
	{
		m_index = 0;
		m_seed = 0;
	}

	unsigned int NextNumber()
	{
		unsigned int ret = MurmurHash2(&m_index, sizeof(m_index), m_seed);
		m_index++;
		return ret;
	}

	unsigned int LastNumber()
	{
		unsigned int lastIndex = m_index - 2;
		unsigned int ret = MurmurHash2(&lastIndex, sizeof(lastIndex), m_seed);
		m_index--;
		return ret;
	}

	// to be able to save / restore state for a save game or whatever else
	void GetState(unsigned int &index, unsigned int &seed)
	{
		index = m_index;
		seed = m_seed;
	}

	void SetState(unsigned int index, unsigned int seed)
	{
		m_index = index;
		m_seed = seed;
	}

private:
	unsigned int m_index;
	unsigned int m_seed;
};

int main(int argc, char**argv)
{
	// create and seed our random number generator.  If two similar numbers are hashed
	// they should give very different results usually, so for a seed, we can hash the
	// time in seconds, even though the number from run to run should be really similar
	CReversablePRNG prng;
	unsigned int currentTime = time(NULL);
	unsigned int seed = MurmurHash2(&currentTime, sizeof(currentTime), 0x1337beef);
	prng.SetState(0, seed);

	// display our seed and our table header
	printf("seed = %urn", seed);
	printf("index | raw number | mod 10rn");
	printf("---------------------------rn");

	// generate 10 numbers forward
	for (int index = 0; index < 10; ++index)
	{
		unsigned int nextNumber = prng.NextNumber();
		printf("%2i    | %10u | %urn", index, nextNumber, nextNumber % 10);
	}

	// generate 3 numbers back
	printf("rn");
	for (int index = 0; index < 3; ++index)
	{
		unsigned int lastNumber = prng.LastNumber();
		printf("%2i    | %10u | %urn", 8 - index, lastNumber, lastNumber % 10);
	}

	// generate 5 numbers forward
	printf("rn");
	for (int index = 0; index < 5; ++index)
	{
		unsigned int nextNumber = prng.NextNumber();
		printf("%2i    | %10u | %urn", 7 + index, nextNumber, nextNumber % 10);
	}

	system("pause");

	return 0;
}

mrmrout4

mrmrout3

mrmrout2

mrmrout1

Next Up

Hopefully you enjoyed this post!

Next up I’m going to be applying this code to the problem of shuffling to continue on from the post where I tried to do that before: Fast & Lightweight Random “Shuffle” Functionality.

Why do you hate me rand()?!

TL;DR – I’ve always heard rand() sucked for generating (cryptographically strong) random numbers, but it turns out it’s just kind of bad in general too LOL.

OK so this is bizarre, I made a default settings console project in MSVC 2012 with the code below:

#include 
#include 
#include 

int main(int argc, char** argv)
{
	time_t thetime = 0;
	time(&thetime);
	srand(thetime);
	int a = rand();
	int b = rand();
	int c = rand();
	int d = rand();

	printf("time = %llu (%llu)rna = %irnb = %irnc =t %irnd = %irn", thetime, thetime % RAND_MAX, a, b, c, d);
	return 0;
}

Here are some sample outputs, can you see what’s wrong?!

time = 1371620230 (26377)
a = 11108
b = 28489
c = 18911
d = 15679
time = 1371620268 (26415)
a = 11232
b = 10944
c = 9621
d = 12581
time = 1371620289 (26436)
a = 11301
b = 7285
c = 24321
d = 26390
time = 1371620310 (26457)
a = 11369
b = 3625
c = 6252
d = 7432
time = 1371620332 (26479)
a = 11441
b = 10714
c = 6048
d = 12537

5 times in a row you can see that the first number randomly generated is in the 11,000’s. You can also see that it’s steadily increasing.

I included the time modulo RAND_MAX in case that was the first number returned but it isn’t. I also looked at the numbers in hex and there isn’t a clear pattern there either. I can’t really discern the correlation between the time and the first random number, but there is definitely a pattern of some kind.

You always hear you shouldn’t use rand() if you need really high quality random numbers (like used for encryption), but i always figured if you use srand() with time, your number will be good enough for games at least. Turns out, you might want to throw out the first random number rand gives you before using the stuff for your games too. Maybe throw out a couple just in case! 😛

You might wonder why b,c,d are seemingly more random then a, but that’s likely due to the Avalanche Effect aka “sensitivity to initial conditions” which as it turns out is a nice property of cryptographic algorithms as well as pseudo random number generators. That is also a fundamental idea from Chaos Theory.

Essentially, as you ask for more random numbers, they ought to be more unpredictable, and more “random”. You just get some trash in the beginning.

Anyways… I’m super surprised by just how bad rand() is… I guess I never looked at it like this before (or maybe this is some new bad behavior in MSVC 2012?). Also, RAND_MAX is defined for me as 0x7fff. Ouchies, where are the rest of our numbers? 😛

Fast & Lightweight Random “Shuffle” Functionality

shufflewhatif

NOTE: THIS ARTICLE ENDS IN FAILURE. IF YOU WANT TO SKIP AHEAD TO THE SUCCESSFUL METHOD CHECK OUT THIS POST: Fast & Lightweight Random “Shuffle” Functionality – FIXED

Sometimes in game development we have a list of things that we want to choose from randomly, but we want to make sure and only choose each thing one time.

For instance, let’s say you are making a game where the player gets quests randomly from an NPC, but when they finish the current group of quests, they can then move onto the next group of quests (because the next group is harder than the first group).

Here’s some obvious ways you might implement this:

  • Make a list of the items you want to choose randomly from. When you choose an item from the list, remove it from the list so that it won’t be chosen from next time. You have to allocate and maintain (and maybe serialize) a list which is a little bit of a bummer, but it gets the job done.
  • Add a flag or bool to each item to remember whether it’s been chosen from. When you randomly choose an item, if it has already been chosen, roll a random number again. If it hasn’t been chosen, mark it as having been chosen and use it. The bummer here is that it isn’t constant time. If you have a lot of items to choose from, when you get near the end and only have a few items left unused, you’ll have to roll a bunch of random numbers before you find a valid one to use.
  • Taking a cue from the last article Efficiently Generate Random Numbers Without Repeats, you could go with the 2nd option, but count how many unused items there are, roll a random number for the number of unused items, and then count through the list again to find which unused item you rolled. This is nicer, but it might be costly to have to traverse the list, especially if the list has a lot of items in it.

Computer Science Magic

Computer science has some neat things to it that could help you out here. Most notably, there are various algorithms for traversing numbers or multidimensional spaces in ways other than sequentially, for various reasons. Here are 2 such things for example (but there are many more out there waiting for you to find them!):

You could likely leverage something like these guys to traverse the items in a list in a way that looked random to a player, but would actually be well defined and have a pattern to them. The unfortunate thing about these is that they may be the same “random” every time though. With some creativity and programming alchemy you might be able to get around that problem though.

If something like this works well enough for you, it might be your solution!

XOR Magic

XOR is a magical thing. I started game development back in 16 bit CPU days before hardware accelerated graphics, and when GL (and later directX) first appeared, my first question was “this is all very neat, but how do i XOR pixels?” Sadly, the answer was that I couldn’t, and as far as I know, we still don’t have that ability with shaders and it makes me sad LOL.

Anyways, besides making really snazzy “programmer art” style graphics (and selection boxes), XOR is one of the corner stones of encryption and other cryptography. (Like Cryptography? I have a bunch of posts on it, go check em out! Cryptography 101)

For instance, the “one time pad” is the ONLY mathematically proven uncrackable encryption scheme, and it uses XOR. In fact, it ONLY uses XOR.

One property of XOR that makes it useful for cryptography also makes it useful to us here in our shuffle code. That property is this: If you have a random number, and a non random number, when you XOR them together, the result will be a random number too. (BTW the other property that makes it useful for cryptography is that it’s reversible, but we don’t care about that right now).

Think about that for a minute… that means that if you have a random number, and you count from 1 to 10, XORing each number by the same random number, the resulting numbers ought to be random too. What else is great is that thanks to the fact that Boolean math is deterministic (1 xor 0 is always 1, every time you do it), the numbers you get out will all be unique and not repeat. TA-DA! There are some problems left to solve, but we now have the basis for our shuffle algorithm!

Are you down with the SHUF?

Ok so the basic idea for shuffling is this: We are going to loop through the list normally, but we are going to xor each index against a pre-rolled random number so that it randomizes the index for us in a way that will have no repeats. Let’s pretend that we have 5 things to loop through and that our random number is 3. Let’s try it out:

index 0: 0 ^ 3 == 3
index 1: 1 ^ 3 == 2
index 2: 2 ^ 3 == 1
index 3: 3 ^ 3 == 0
index 4: 4 ^ 3 == 7

Our last number ended up being 7, what the heck happened? Well, the issue here is that it’s randomizing the bits in our indices, not really shuffling our 5 items. With 5 items to loop through, that means there are 3 bits that it is randomizing, which means that we might encounter any 3 bit value at any time (including 7, the highest one!), and that we would need to iterate through all 3 bit values to encounter all the indices that we are looking for (0 through 5). We’ll just have to loop through all 3 bit indices and ignore anything too large. Here’s all of the values:

index 0: 0 ^ 3 == 3
index 1: 1 ^ 3 == 2
index 2: 2 ^ 3 == 1
index 3: 3 ^ 3 == 0
index 4: 4 ^ 3 == 7 (ignore)
index 5: 5 ^ 3 == 6 (ignore)
index 6: 6 ^ 3 == 5 (ignore)
index 7: 7 ^ 3 == 4

Looks like we solved that issue.

The other issue that comes up is that the random number can be any number that can fit in an unsigned int. When we xor a huge number by our small indices, we’ll get giant numbers out as a result.

For instance if our random number was 15367, xoring that against index 3 would give us 15364.

To fix that, we can just use the lowest 3 bits of the random number (& against 7). That way, the random number can only have bits set in the lowest 3 bits, and our index already can only have bits set in the lowest 3 bits, so the end result can also only have bits set in the lowest 3 bits.

I think we are ready to write some code!

The Source Code

#include 
#include 
#include 

template 
struct SShuffler
{
public:
	SShuffler()
	{
		Start();
	}

	// start a new shuffle
	void Start()
	{
		m_index = (unsigned int)-1;
		m_randomNumber = ((unsigned int)rand()) & c_numItemsNextPow2Mask;
	}

	// whether or not the shuffle is finished
	bool IsDone()
	{
		return m_index == c_numItemsNextPow2;
	}

	// Get the next index in the shuffle
	bool Shuffle(unsigned int &shuffleIndex)
	{
		// increment our index until we reach our max index,
		// or we find a valid index
		do
		{
			m_index++;
			shuffleIndex = m_index ^ m_randomNumber;
		}
		while (m_index = c_numItems);

		// if we haven't reached the max index, our shuffle was successful
		return m_index > 1;
	static const unsigned int c_B = c_A | c_A >> 2;
	static const unsigned int c_C = c_B | c_B >> 4;
	static const unsigned int c_D = c_C | c_C >> 8;
	static const unsigned int c_numItemsNextPow2Mask = c_D | c_D >> 16;
	static const unsigned int c_numItemsNextPow2 = c_numItemsNextPow2Mask + 1;

	// member vars
	unsigned int m_index;
	unsigned int m_randomNumber;
};

// our songs that we are going to shuffle through
const unsigned int g_numSongs = 10;
const char *g_SongList[g_numSongs] =
{
	"1. Head Like a Hole",
	"2. Terrible Lie",
	"3. Down in It",
	"4. Sanctified",
	"5. Something I Can Never Have",
	"6. Kinda I Want to",
	"7. Sin",
	"8. That's What I Get",
	"9. The Only Time",
	"10. Ringfinger"
};

int main(void)
{
	// use the current time as a seed for our random number generator
	srand((unsigned)time(0));

	// declare a shuffler object
	SShuffler shuffler;

	// shuffle play once
	printf("I wanna listen to some NIN...(seed = %i)rnrn", shuffler.DebugGetSeed());
	unsigned int shuffleIndex = 0;
	while(!shuffler.IsDone())
	{
		if (shuffler.Shuffle(shuffleIndex))
			printf("%srn",g_SongList[shuffleIndex]);
	}

	// shuffle play again
	shuffler.Start();
	printf("rnThat was great, let's listen again! (seed = %i)rnrn", shuffler.DebugGetSeed());
	while(!shuffler.IsDone())
	{
		if (shuffler.Shuffle(shuffleIndex))
			printf("%srn",g_SongList[shuffleIndex]);
	}

	printf("rn");
	system("pause");
	return 0;
}

Example Run

Here’s the output of an example run of this program. Note that if ever you encounter seed 0, it will not shuffle at all. Also, if you encounter seed 15, it will play the list exactly backwards!

shuffle

Something Weird Going On

After playing with this stuff a bit, it looks like even though this technique works “ok”, that it actually doesn’t randomize the list as much as I thought it was. It looks like no matter what my seed is, adjacent numbers seem to “pair up”. Like 1 and 2 will always be next to each other but will change which comes first. Same with 3 and 4, 5 and 6, etc.

I think the problem is that if you have a set of numbers in order, that for each possible order those numbers can be in, there doesn’t exist a number you can XOR the set of numbers to get to be in that order. I think that even though a simple XOR can re-arrange the numbers, it can’t give you all possible combinations (which makes sense… 16 seeds is a lot less than 10 factorial, which is how many combinations there ought to be!)

I have to play around with it some more and think about it a little more though. There might be a way at least to make it better, maybe using some more bits from the random number to do more math operations on the index or something.

Efficiently Generate Random Numbers Without Repeats

Sometimes in game development, you want to get a random number, but you want it to be different than the last random number you got.

For instance, let’s say you were making a game where the player was an elemental golem and could change between ice, fire, electricity and earth randomly but it cost 50 mana to change to a new random element.

When the player presses the button to change forms, if your game rolled a random number to figure out the new element, sometimes it would choose the exact same element that the player already had, but the player would still get charged the 50 mana.

As a player, wouldn’t you feel kind of ripped off if that happened? Wouldn’t it be better if it just chose randomly from all elements except the current one?

I’m going to show you how so if you want to think about it a bit first and see if you can work out a solution, do so now! SPOILERS AHEAD!

Not so Great Implementations

To implement that, there are a couple of “not so great” ways to do it such as…

  • Make a list of all elements except the current one, and randomly choose from that list. This isn’t so great because you would have to allocate space for a list, copy in the elements, and then do the random roll. Memory allocations and data copying isn’t cheap. Imagine if there were 100 or 1000 different things you were choosing between. Then imagine that this operation happened for enemies every few game loops and that there were 1000 enemies on the screen at a time. That would a be a LOT of overhead just to roll fresh random numbers!
  • Roll a random number and if it’s the same as the current value, roll it again. Repeat until you get a new number. This isn’t so great because this code will randomly take longer than others. As an extreme case for instance, what if the random number generator chose the same number 100 times in a row? It would take 100 times longer to run the code in that case.
  • Another option might be to roll a random number, and if it was the same number as before, just add one (and use modulus to make sure it’s within valid range). This isn’t so great because you are basically giving the next number higher than your current number twice as much chance of coming up versus any other number. The solution shouldn’t bias the random chance in any way.

Solution 1

I faced this problem a few days ago and below is how i solved it. Note that Dice() is zero based, so if you call Dice(6), you will get a random number between 0 and 5. Dice() might be implemented using rand(), or a fast pseudo random number generator like Mersenne Twister or whatever else you would like to use.

unsigned int DiceNoRepeat(unsigned int numSides, unsigned int currentValue)
{
  if (numSides = currentValue)
    newValue++;

  return newValue;
}

Why that works is that if you throw out the current value, there are only numSides – 1 numbers left to choose from, so you first roll a number for which remaining number is the new number.

The numbers you chose from didn’t include the current value, but the values you are working with DO include the current value, so if the value is >= the current value, add one.

This solution is nice because it works in constant time (it only calls Dice() once), and also, it doesn’t mess up the probabilities (the chance of getting each number is the same).

Solution 2

Jay, a friend of mine who I used to work with, came up with a different solution that I think is essentially the same in terms of performance, and also has the nice properties of working in constant time and it doesn’t mess up the probabilities.

unsigned int DiceNoRepeat(unsigned int numSides, unsigned int currentValue)
{
  if (numSides < 1)
    return 0;

  unsigned int offset = Dice(numSides - 1) + 1;

  return (currentValue + offset) % numSides;
}

The reason this works is that instead of picking a random number, you are picking a random offset to add to your current number (wrapping around with modulus). You know that you want to move at least one space, and at most, numSides – 1. Since Dice() is zero based, Dice(numSides – 1) + 1 will give you a random number between 1 and numSides – 1.

When you add the offset to the current value and wrap it around by using modulus against numSides, you get a different random number.

Have a Different Solution?

Have a different way to do this? Post a comment and share! (:

Is pre-increment really faster than post increment? Part 1

If you are a C++ programmer, I’ll bet at some point in time, maybe during a code review, someone told you “hey you should use a pre-increment there because post-increments are slower”.

I had heard this some years ago in the context of for loops with the caveat of “they might have fixed it by now [the compiler], but why take the chance”. Fair enough I thought, and I started using pre-increment in for loops, but kept using post increment in other places because i felt it felt more natural. Also i don’t write code that makes it actually matter. I try to be explicit instead of clever to prevent bugs and such. For instance, lots of C++ programmers with years of experience probably wouldn’t be 100% sure about what order things happen in for this small code sample: *++p = 3;

To be more explicit, you could increment p on one line and then set *p to 3 on the next. That’s easier to get your head around because it’s more explicit. The optimizer can handle the details of combining the code, and as we will see a little bit later, code generation without an optimizer seems to do just fine too!

Ayways, I heard this again today during a code review. Someone saw i had a piece of code where i was incrementing an unsigned int like this (not as part of a for loop, just inside of an if statement): numEnables++;

They said “you should change that to a pre-increment because it’s faster”. Well, I decided I should have a look and see if that was true and that’s where today’s journey begins (:

Disclaimer: All my findings are based on what I’ve seen in MSVC 2010 and 2012 using default settings for debug and release building a win32 console application. When in doubt, check your own disassembly and make sure your code is doing what you think it is. I wish I had years ago so I would have known the truth back then.

Will It Blend?

Lets check out the assembly generated in release of a simple program to see what assembly it makes

int main(void)
{
int someNumber = 3;
someNumber++;
return 0;
}

Here is the resulting code from the disassembly window. The assembly code of our program is in the blue box:

PS you can find the diasassembly window by setting a breakpoint on someNumber++, running the program and when it hits the breapoint, going under the “Debug” menu, selecting “Windows” and then selecting “Disassembly”.

prepost1

Ok so what the heck happened to our program? All it’s doing is xoring the eax register against itself (to set it to zero) and then it’s calling “ret”. That is our “return 0” statement. Everything else got optimized away! The optimizer realized that nothing meaningful changes if it doesn’t calculate someNumber, so it decides not to.

Let’s try a printf to print out our number. That way the optimizer CAN’T optimize our code away.

#include
#include

int main(void)
{
int someNumber = 3;
someNumber++;
printf("someNumber = %irn", someNumber);
return 0;
}

Now here’s the disassembly:

prepost2

Ok we are doing better i guess. we see a push, a call, an add and then our return 0 code (xor eax, eax and ret).

I included the watch window so that you could see the values that the code is working with. The push pushes our printf format string onto the stack, and then it calls printf. Then it has the line “add sp, 8”. What that does is move the stack pointer up 8 bytes. Parameters are passed on the stack, so that line is there to undo the 4 byte push of the printf format string, and also the 4 byte push of someNumber. It’s just cleaning up the stack.

But where the heck was someNumber pushed onto the stack? I actually have no idea… do you know? If you know, please post a comment, I’m really curious how that works LOL.

EDIT: Thanks to Christophe for unraveling this, he points out that the assembly window was just not showing all the instructions:

For some reason, your MSVC debug window screencap shows the actual main() asm debug starting at 0x10b1002, not 0x10b1000. If you look at the line above at 0x10b0fff, it shows some “add byte ptr [edx+4], ch”. Which is because MSVC incorrectly deduced that there were some code starting at 0x10b0fff and so it messes up the debug print out of the actual code at 0x10b1000 (which would be some “push 4” on the stack, which is the incremented “someNumber” var).

We are not quite there. The optimizer is doing some funky stuff, and I think it’s because the optimizer knows that someNumber is the number “4”. It doesnt have to do any run time calculations so it isn’t

To make it so the optimizer can’t optimizer our number away like that, lets have the user input someNumber so the compiler has no idea what the right value is until run time.

#include
#include

int main(void)
{
int someNumber = 0;
printf("Please enter a number.rn");
scanf("%i",&someNumber);
someNumber++;
printf("someNumber = %irn", someNumber);
return 0;
}

And here’s the code generated. The important part (our increment) is in the blue box:

prepost3

Ok there’s our post increment code. Let’s change the post increment to a preincrement:

#include
#include

int main(void)
{
int someNumber = 0;
printf("Please enter a number.rn");
scanf("%i",&someNumber);
++someNumber;
printf("someNumber = %irn", someNumber);
return 0;
}

And here’s the code it generates:

prepost4

If you compare this generated code to the previously generated code, you’ll see it’s the exact same (some memory addresses have changed, but the instructions are the same). This is pretty good evidence that for this type of usage case, it doesn’t matter if you use post or pre increment – in RELEASE anyways.

Well, you might think to yourself “the optimizer does a good job, but if using a pre-increment, maybe you can make your DEBUG code faster so it isn’t so painful to debug the program”. Good point! It turns out that in debug though, preincrement and post increment give the same generated code. Here it is:

prepost5a
prepost5b

So, it looks like in this usage case that the compiler does not care whether you use preincrement or post increment.

Let’s check out for loops real quick before calling it a day.

For Loops

Here’s our post increment for loop testing code. Note we are doing the same tricks as before to prevent the key parts from getting optimized away.

#include
#include

int main(void)
{
int someNumber = 0;
printf("Please enter a number.rn");
scanf("%i",&someNumber);
for(int index = 0; index < someNumber; index++)
printf("index = %irn", index);
return 0;
}

Post and pre-increment generate the same code in release! Here it is:

prepost6

It turns out they generate the same code in debug as well:

prepost7a
prepost7b

Summary

Well, it looks like for these usage cases (admittedly, they are not very complex code), it really just does not matter. You still ought to check your own usage cases just to make sure you see the results I’m seeing. See something different? Post a comment!

In part 2 I’ll show you a case where using pre or post increment does matter. Here it is!

Is pre-increment really faster than post increment? Part 2