Is pre-increment really faster than post increment? Part 1

If you are a C++ programmer, I’ll bet at some point in time, maybe during a code review, someone told you “hey you should use a pre-increment there because post-increments are slower”.

I had heard this some years ago in the context of for loops with the caveat of “they might have fixed it by now [the compiler], but why take the chance”. Fair enough I thought, and I started using pre-increment in for loops, but kept using post increment in other places because i felt it felt more natural. Also i don’t write code that makes it actually matter. I try to be explicit instead of clever to prevent bugs and such. For instance, lots of C++ programmers with years of experience probably wouldn’t be 100% sure about what order things happen in for this small code sample: *++p = 3;

To be more explicit, you could increment p on one line and then set *p to 3 on the next. That’s easier to get your head around because it’s more explicit. The optimizer can handle the details of combining the code, and as we will see a little bit later, code generation without an optimizer seems to do just fine too!

Ayways, I heard this again today during a code review. Someone saw i had a piece of code where i was incrementing an unsigned int like this (not as part of a for loop, just inside of an if statement): numEnables++;

They said “you should change that to a pre-increment because it’s faster”. Well, I decided I should have a look and see if that was true and that’s where today’s journey begins (:

Disclaimer: All my findings are based on what I’ve seen in MSVC 2010 and 2012 using default settings for debug and release building a win32 console application. When in doubt, check your own disassembly and make sure your code is doing what you think it is. I wish I had years ago so I would have known the truth back then.

Will It Blend?

Lets check out the assembly generated in release of a simple program to see what assembly it makes

int main(void)
{
int someNumber = 3;
someNumber++;
return 0;
}

Here is the resulting code from the disassembly window. The assembly code of our program is in the blue box:

PS you can find the diasassembly window by setting a breakpoint on someNumber++, running the program and when it hits the breapoint, going under the “Debug” menu, selecting “Windows” and then selecting “Disassembly”.

prepost1

Ok so what the heck happened to our program? All it’s doing is xoring the eax register against itself (to set it to zero) and then it’s calling “ret”. That is our “return 0” statement. Everything else got optimized away! The optimizer realized that nothing meaningful changes if it doesn’t calculate someNumber, so it decides not to.

Let’s try a printf to print out our number. That way the optimizer CAN’T optimize our code away.

#include
#include

int main(void)
{
int someNumber = 3;
someNumber++;
printf("someNumber = %irn", someNumber);
return 0;
}

Now here’s the disassembly:

prepost2

Ok we are doing better i guess. we see a push, a call, an add and then our return 0 code (xor eax, eax and ret).

I included the watch window so that you could see the values that the code is working with. The push pushes our printf format string onto the stack, and then it calls printf. Then it has the line “add sp, 8”. What that does is move the stack pointer up 8 bytes. Parameters are passed on the stack, so that line is there to undo the 4 byte push of the printf format string, and also the 4 byte push of someNumber. It’s just cleaning up the stack.

But where the heck was someNumber pushed onto the stack? I actually have no idea… do you know? If you know, please post a comment, I’m really curious how that works LOL.

EDIT: Thanks to Christophe for unraveling this, he points out that the assembly window was just not showing all the instructions:

For some reason, your MSVC debug window screencap shows the actual main() asm debug starting at 0x10b1002, not 0x10b1000. If you look at the line above at 0x10b0fff, it shows some “add byte ptr [edx+4], ch”. Which is because MSVC incorrectly deduced that there were some code starting at 0x10b0fff and so it messes up the debug print out of the actual code at 0x10b1000 (which would be some “push 4” on the stack, which is the incremented “someNumber” var).

We are not quite there. The optimizer is doing some funky stuff, and I think it’s because the optimizer knows that someNumber is the number “4”. It doesnt have to do any run time calculations so it isn’t

To make it so the optimizer can’t optimizer our number away like that, lets have the user input someNumber so the compiler has no idea what the right value is until run time.

#include
#include

int main(void)
{
int someNumber = 0;
printf("Please enter a number.rn");
scanf("%i",&someNumber);
someNumber++;
printf("someNumber = %irn", someNumber);
return 0;
}

And here’s the code generated. The important part (our increment) is in the blue box:

prepost3

Ok there’s our post increment code. Let’s change the post increment to a preincrement:

#include
#include

int main(void)
{
int someNumber = 0;
printf("Please enter a number.rn");
scanf("%i",&someNumber);
++someNumber;
printf("someNumber = %irn", someNumber);
return 0;
}

And here’s the code it generates:

prepost4

If you compare this generated code to the previously generated code, you’ll see it’s the exact same (some memory addresses have changed, but the instructions are the same). This is pretty good evidence that for this type of usage case, it doesn’t matter if you use post or pre increment – in RELEASE anyways.

Well, you might think to yourself “the optimizer does a good job, but if using a pre-increment, maybe you can make your DEBUG code faster so it isn’t so painful to debug the program”. Good point! It turns out that in debug though, preincrement and post increment give the same generated code. Here it is:

prepost5a
prepost5b

So, it looks like in this usage case that the compiler does not care whether you use preincrement or post increment.

Let’s check out for loops real quick before calling it a day.

For Loops

Here’s our post increment for loop testing code. Note we are doing the same tricks as before to prevent the key parts from getting optimized away.

#include
#include

int main(void)
{
int someNumber = 0;
printf("Please enter a number.rn");
scanf("%i",&someNumber);
for(int index = 0; index < someNumber; index++)
printf("index = %irn", index);
return 0;
}

Post and pre-increment generate the same code in release! Here it is:

prepost6

It turns out they generate the same code in debug as well:

prepost7a
prepost7b

Summary

Well, it looks like for these usage cases (admittedly, they are not very complex code), it really just does not matter. You still ought to check your own usage cases just to make sure you see the results I’m seeing. See something different? Post a comment!

In part 2 I’ll show you a case where using pre or post increment does matter. Here it is!

Is pre-increment really faster than post increment? Part 2


6 comments

  1. I think your lead in “If you are a C++ programmer…” hints at where it’s safe to use post increment. If I’m writing C then it’s safe to use post increment as the compiler will do the right thing. But if I’m writing C++ then I know that the compiler can’t be smart enough to do the right thing and it’s up to me to be explicit in what I want.

    I had hoped the C++11’s rvalue references might have allowed the removal of post increment’s penalty but after 30 minutes of searching rvalue and return value optimization, etc, I don’t think it solves the problem.

    Like

    • Looks like you already know what’s coming in part 2 then poday hehe (:

      Good thinking about the rvalue/std::move features. It’s a bummer that didn’t bear fruit.

      BTW not sure if you saw but i finally got what you were talking about mixing sizeof() w/ template stuff. It’s like… why use sizeof() to detect the type when you can use template specialization for the yes/no types right? 😛 Thanks for that

      Like

      • Yeah… I was really hoping that I could short circuit your part 2 with a “If you’re using the latest compiler and you can guarantee your code was written by a competent C++11 programmer then it really doesn’t matter if you use post increment.”

        regarding this line in your reply to the previous article:
        “The stuff I’ve seen and talked about so far uses the sizeof trick to determine what function was chosen or which template was chosen due to SFINAE.”
        You could probably use some variation of “typedef decltype(f(0)) type;” to determine function return type.

        Sizeof is a neat trick but several things were added to C++11 to specifically address that shortcoming that the sizeof trick steps around. And there’s always a voice in the back of my mind that says “sizeof(char) could equal sizeof(int), the standard only specifies that char is the smallest addressable element, what min & max values that char and int can store, and that sizeof(char) is less than or equal to sizeof(short) which is less than or equal to sizeof(int)”. There could be some strange machine where the smallest addressable space is also the same size of an integer.

        Like

      • Definitely. I’ve seen the better SFINAE / sizeof info on the net will use char[1] and char[2] or that sort of thing, as the yes/no types to avoid those problems.

        Like

  2. Pingback: Is pre-increment really faster than post increment? Part 2 | The blog at the bottom of the sea


Leave a comment