Permutation Programming Without Maintenance Nightmares

Example Code: TemplateSlots.cpp

When writing real time software, we sometimes hit situations where we have code that needs to be lightning fast, but also needs to be configurable to change how it behaves.

For instance, if you are writing something that renders 3d graphics (either traditional rasterized 3d graphics, or raytraced 3d graphics) you might be forced to put an if statement in a deep inner loop saying “is this object textured? Does this object have emissive lighting? If so, call the appropriate functions”.  Or, in the case of something like DSP / Audio processing, you might have an if statement saying “do we need to increase the amplitude of our samples?  if so do that”.

Depending on your use case, this can result in tens of thousands of checks per second or more, which can impact performance quite a bit.

This puts us in a pickle between two extremes:

  1. We can decide to live with the the checks to see which functionality is enabled.  The end result is code that is easy to maintain, but it comes at a cost in execution speed.
  2. We can hand craft each permutation of functionality needed, and call the correct function based on parameters in an outer loop. For example, in an outer loop, we call a function like “RenderPolygonNoTextureYesEmissive”.  This means that we copy / paste a lot of code and change small pieces of each function to act like it should for the specific permutation.  This results in form fitted code that can be a lot faster, but will be harder to maintain.

Someone might look at the above and say “#1 makes slower code that is easier to maintain?  Programmers need to stop being lazy and just go with #2 so that it’s faster at runtime”.  When code is easier to maintain though, it means that programmers can spend less time working on bugs and features in this code, freeing them up to work on other things, and also it means that there will be less bugs in the code to begin with.  Making code that’s easier to maintain is a decision that affects team productivity and the business as a whole, so it isn’t something we can very easily dismiss, even at the cost of runtime performance.

Quick aside: there are features in modern CPUs like branch prediction, as well as features in modern compilers / linkers / optimizers that can mitigate some of this stuff, but they can’t always, and if you rely on the behavior of a specific compiler or CPU, you’ll have code maintenance problems when you have to start supporting a different compiler or CPU.  Console and mobile processors are very different beasts than desktop processors for example, which can make relying on assumptions like this a big problem.

One common solution to the code permutation problem is to generate code for all the permutations needed to get the speed of solution #2, but have that code generated based on a few core pieces of code to get the maintainability of solution #1.   People can do this with clever macro usage, or even actually generate code using a custom build step while compiling.

In the graphics world, it’s fairly common that AAA game engines actually have a shader permutation compiler built into the engine to handle this exact problem.  The situation there is a little bit different because shaders aren’t written in C++, but it’s the same basic problem.

I’m going to show a variation to the code generation solution using templates, template parameters, and inline functions.

Before we get into it I want to note two important things:

  • All observations I make about function inlining and optimizations preformed by the compiler were made using Microsoft Visual Studio 2010 using the default settings for a release built console application.  Your mileage may vary!
  • The example code attached to this post works as a makeshift compressor and limiter.  That DSP code is just there as an example of using this technique and this should in no way be used as a lesson on how to write a compressor or limiter.  Many features are missing and major shortcuts have been taken to keep the code simple!

OOPS!

After posting this, a friend pointed out two things to me that I want to share…

  1. Apparently this technique is already well known and used. It’s called “traits” and it’s talked about in the book Modern C++ Design by Andrei Alexandrescu.
  2. The sample code doesn’t seem to compile in llvm or clang. I don’t have easy access to those guys so… sorry about that.

Inline Functions – The Silent Killer

Ok, I’m joking, inline functions aren’t the #1 cause of death for males between the age of 23 and 25.

When people talk about using inline functions, they always seem to only talk about how inline functions remove the overhead of the function call, setting up the stack for the local parameters, the cost of the return, and any object copies that might have happened for the parameters or return value.

A far less talked about feature of inline functions is that it promotes the function code to be a full sibling to the calling code in the eyes of the optimizer.

If your inlined function checks a boolean parameter to take different action whether it’s true or false, and you call that inlined function using a compile time constant for that parameter (ie just pass “true” , not a variable), that means that the optimizer has the opportunity to completely get rid of the branch of code that will never get executed and get rid of the if statement all together.  The optimizer just ditched the inner loop “if check” we talked about earlier that was happening tens of thousands of times a second.

If you call the same inlined function in another place, passing a boolean variable for the parameter, that other location will keep the if statement intact and will act as you would expect.

This is the first step in our battle against code permutations – inline functions can be used to write code once that shrink wraps itself to whatever your specific use cases may be!

This also gives the optimizer the ability to combine various pieces of  code together in clever ways, especially if you call multiple inline functions and it has more code to work with.

Not All Unicorns and Lollipops – Code Bloat and Cache Misses

Inline functions do come at a cost though and are not a magical solution to every problem.  Most notably, inline functions bloat the size of your compiled code.  For most people, code size on disk isn’t really a huge problem because hard drives are huge and loading an exe sized file into ram is not going to be a problem for normal cases.

The problem with increasing code size mainly comes up when dealing with the instruction cache.

Processors have built in cache memory that they can read and write to super quickly, but the cache memory can only hold so much data.  Whenever the processor needs to read something from RAM that isn’t in the cache, it has to load it from RAM into the cache, and then it can read it.  This is called a cache miss and can a major source of performance problems.  Desktop CPUs can get around this problem by doing other things while waiting for the data to come in (http://en.wikipedia.org/wiki/Out-of-order_execution), but other processors like consoles and mobile devices aren’t able to do that, so they just sit there stalled while waiting for the memory to come back.  Very slow!

To minimize cache misses, you should set up your code and data structures so that they don’t have to jump around in memory a lot, which can cause the CPU to have to load and unload the same peices of RAM into the cache needlessly.

Another thing you can do to minimize cache misses is to try to make your data as small as possible by storing less data or using smaller data types.  This way, the CPU is able to hold more meaningful data in it’s cache at a time, which means less RAM access needed.

These techniques also apply to your program itself inside the instruction cache.  If you have code that jumps around a lot (e.g. gotos that jump really far away for those that use gotos, or excessive function calls), that can cause the CPU to have to excessively load different parts of your program from RAM into it’s cache which can be a costly operation.   Also just like with data, if you keep your code smaller, the CPU will be able to hold more meaningful data in the cache at a time and have to hit RAM less.

Lastly, if you ever make a for or while loop that is too large to fit in the instruction cache, that means that you can get a cache miss each time you iterate through the loop!  Making functions inlined, especially if they are called multiple times inside of the loop, can cause the contents of a loop to become larger and cause this problem.

In the end though, if you’d like you can just mark your functions as inline and let the compiler decide if it would like to actually inline them or not since it’s just a hint.  A lot of the times the compiler ought to make decent choices for you, but of course you are free to go down into this rabbit hole as you want.  Since the compiler doesn’t actually understand your algorithm, you can certainly do better than it can in some situations and make better trade offs. In MSVC there is an option to disallow the compiler from going against your wishes for inlining a function if you want to go this route. Other compilers probably have similar options.

The Main Course – Templated Classes With Worker Slots

Now that you see how inline functions can be used to write custom fitting code permutations for you, you can use this to your advantage by having the class or function that does your work take a template parameter for each piece of configurable work that you would like it to do.

You might define a class like this:

template <class EmissiveLighter, class Texturer>
class CRenderer
{
public:
	static void Render(const CSomeDummyParams &params)
	{
		// render
		...
		// now apply emissive lighting and texture
		EmissiveLighter::ApplyEmissive(params);
		Texturer::ApplyTexture(params);
	}
};

Then, you might define “Slot Classes” like the below

//do nothing for emissive lighting and pay no runtime cost for using this
class EmissiveLighter_None
{
public:
	static inline void ApplyEmissive(const CSomeDummyParams &) {}
};

//apply the emissive color defined in the params struct
//Note, we are using templates but not interfaces so the parameter declarations can change to suit our needs
class EmissiveLighter_Params
{
public:
	static inline void ApplyEmissive(CSomeDummyParams &params)
	{
		params.m_outColor += params.m_emissive;
	}
};

Now, you can call the render function and supply what operations you want to use for each slot and it will generate the form fitted permutations for you.

// no emissive lighting, no texture
Render<
	EmissiveLighter_None,
	Texturer_None>
	::Render(params);

// emissive lighting from params, no texture
Render<
	EmissiveLighter_Params,
	Texturer_None>
	::Render(params);

// emissive lighting from params, texture from params
Render<
	EmissiveLighter_Params,
	Texturer_Params>
	::Render(params);

// emissive lighting from params, use procedural noise texture generator
Render<
	EmissiveLighter_Params,
	Texturer_ProceduralNoise>
	::Render(params);

Also, a quick note… templates also cause code bloat like inline functions do, because behind the scenes, it will create a different class for each permutation of template parameters that you actually use.  Just something to be wary of!

Why No Interface Classes?

You might be saying “you know, you could make an interface class with pure virtuals to make sure that your slot classes implemented the required functions in the required ways.  That would increase type safety and such”.

That is definitely true, but in this case I think there is actually good value in NOT making interface classes that the various slot classes derive from.

To see what I mean, check out the difference between the two emissive lighter classes i defined.  One takes the params parameter as a const reference, and the other takes it as a non const reference.  This lets you decide what you want to do with the parameters on an implementation by implementation basis.

This lets you give further optimization hints to the compiler, allowing you to define unused parameters as const refs.

It lets you further “shrink wrap” the code to suit your specific needs in each permutation, and you still get a lot of compile time protection when you use the type in a template.

Last Words

If you go this route, it definitely could make some issues more difficult to debug, especially if you stray from the straight and narrow usage. For instance, you might make the EmissiveLighter_Params set some value on the params that the Texturer_Params uses because Texturer_Params assumes that emissive will always use params too when texture params is used. If you switch the emissive to something else but leave the texturer alone, the texturer could stop working and it could be hard to realize why. My advice if using this technique is to make each slot class as isolated as possible and make as few assumptions as possible about other slots. Also make sure and name slot classes so that it’s very obvious what they do from the name and make sure they don’t do anything other than the name implies. If they do anything else, change the name to reflect that! Don’t worry about not being able to share calculations between slot classes to improve efficiency either… since they ought to all be inlined, the optimizer ought to be able to combine some of that and increase re-use for you when it’s appropriate.

When using this in the real world, you might have to come up with some sort of “multiplexer” function where you pass it your variable parameters and it will return a function pointer to the specific function you ought to use.  A function pointer points at the function in the specific template instantiation asked for, so getting a function pointer back would be enough to translate your parameters into the right template instantiation to use. I’m not sure of a way to do this other than a bunch of embeded switch statements but I’ll be thinking about it and report back or write a new article if I come up with an interesting way to do that. The main problem there comes in mapping run time dynamic values to compile time static values.

Since this technique is using template classes and/or template functions you might be able to do some interesting things with template specialization, partial template specialization and template parameter deduction.

For instance, you could introduce SSE calls if you have to process more than a specific amount of data, but just do simple loops if doing less than that.

I made some example code that goes into this stuff in some more depth and has a few interesting techniques I didn’t go over in the article.

The example source code makes frequent use of casting like “(DataType)0.5” but don’t worry, those ought to be compile time operations for all basic types and have no runtime cost (verified in my compiler anyways)

Example Code: TemplateSlots.cpp


3 comments

  1. I thought that you were going to dive into C++ 11’s template meta programming. In short it’s awesome. By doing a lot of operator overloading in templates and type deductions the compiler can do. Check out http://channel9.msdn.com/Series/C9-Lectures-Stephan-T-Lavavej-Core-C-#tab_sortBy_sequential for some well done talks.

    Something like:
    template struct is_container_helper
    {
    template static std::true_type f(typename U::const_iterator *);
    template static std::false_type f(…);

    typedef decltype(f(0)) type;
    };
    template
    struct is_container : public is_container_helper::type
    {
    };

    Can be used to determine if T has an iterator which then can be passed into another template call which branches passed upon if the value was std::true_type or std::false_type

    such as:
    template
    void SerializeToXML(std::ostream& output, const T& t)
    {
    SerializeToXMLInternal(output, t, typename is_container::type());
    };

    then the SerializeToXMLInternal has two different implementations:
    template
    void SerializeToXMLInternal(std::ostream& output, const T& t, std::true_type /*container*/);
    template
    void SerializeToXMLInternal(std::ostream& output, const T& t, std::false_type /*container*/);

    and the correct function will be called based upon if T has an iterator and then all that extraneous stuff melts away as syntax sugar during compile.

    The real hitch is that you have to know what type you’re iterating over before you enter the loop(s) so that you can use the correct template arguement (and they should all be the same type).

    The downsides: It’s still new tech and while some compilers have recently started stating that they’re C++11 feature complete that doesn’t mean that they all agree on the feature definitions. Also most compilers don’t support all the features yet and they tend to support different sub sets. The template meta programming tends to get really hard to read, there’s a certain “Here be dragons” to writing and reading compile warnings but that would be isolated to where the template type deductions actually occur.

    Like

    • and it looks like all my angle brackets were eaten… Every template declaration was template [typename T] and the SerializeToXMLInternal(output, t, typename is_container::type());
      should be
      SerializeToXMLInternal(output, t, typename is_container[T]::type());

      Like


Leave a comment