Exploring Compile Time Hashing

Never put off til run time what can be done at compile time.

C++11 gave us constexpr which lets us make C++ code that the compiler can run during compilation, instead of at runtime.

This is great because now we can use C++ to do some things that we previously had to use macros or templates for.

As with many of the newish C++ features, it feels like there are some rough edges to work out with constexpr, but it adds a lot of new exciting capabilities.

In this post we will explore some new possibilities, and get a better feel for this new world of compile time code execution. There are going to be some unexpected surprises along the way 😛

Testing Details

The code in this post will be using some compile time crc code I found at Github Gist: oktal/compile-time-crc32.cc. I haven’t tested it for correctness or speed, but it serves the purpose of being a compile time hash implementation that allows us to explore things a bit.

I’ve compiled and analyzed the code in this post in visual studio 2015, in both debug/release and x86/x64. There are differences in behavior between debug and release of course, but x86 and x64 behaved the same. If you have different results with different compilers or different code, please share!

With that out of the way, onto the fun!

We are going to be looking at:

  1. Simple Compile Time Hashing Behavior
  2. Compile Time Hash Switching
  3. Leveraging Jump Tables
  4. Perfect Hashing
  5. Minimally Perfect Hashing
  6. Compile Time Assisted String To Enum

Simple Compile Time Hashing Behavior

Let’s analyze some basic cases of trying to do some compile time hashing

    const char *hello1String = "Hello1";
    unsigned int hashHello1 = crc32(hello1String);  // 1) Always Run Time.
    unsigned int hashHello2 = crc32("Hello2");      // 2) Always Run Time.

    // 3) error C2131: expression did not evaluate to a constant
    //const char *hello3String = "Hello3";
    //constexpr unsigned int hashHello3 = crc32(hello3String);
    constexpr unsigned int hashHello4 = crc32("Hello4");  // 4) Debug: Run Time.  Release: Compile Time

    printf("%X %X %X %X\n", hashHello1, hashHello2, hashHello4, crc32("hello5"));  // 5) Always Run Time. (!!!)

Let’s take a look at the assembly for the above code when compiled in debug. The assembly line calls to crc32 are highlighted for clarity.

    const char *hello1String = "Hello1";
00007FF717B71C3E  lea         rax,[string "Hello1" (07FF717B7B124h)]  
00007FF717B71C45  mov         qword ptr [hello1String],rax  
    unsigned int hashHello1 = crc32(hello1String);  // 1) Always Run Time.
00007FF717B71C49  xor         edx,edx  
00007FF717B71C4B  mov         rcx,qword ptr [hello1String]  
00007FF717B71C4F  call        crc32 (07FF717B710C3h)  
00007FF717B71C54  mov         dword ptr [hashHello1],eax  
    unsigned int hashHello2 = crc32("Hello2");      // 2) Always Run Time.
00007FF717B71C57  xor         edx,edx  
00007FF717B71C59  lea         rcx,[string "Hello2" (07FF717B7B12Ch)]  
00007FF717B71C60  call        crc32 (07FF717B710C3h)  
00007FF717B71C65  mov         dword ptr [hashHello2],eax  

    // 3) error C2131: expression did not evaluate to a constant
    //const char *hello3String = "Hello3";
    //constexpr unsigned int hashHello3 = crc32(hello3String);
    constexpr unsigned int hashHello4 = crc32("Hello4");  // 4) Debug: Run Time.  Release: Compile Time
00007FF717B71C68  xor         edx,edx  
00007FF717B71C6A  lea         rcx,[string "Hello4" (07FF717B7B134h)]  
00007FF717B71C71  call        crc32 (07FF717B710C3h)  
00007FF717B71C76  mov         dword ptr [hashHello4],eax  

    printf("%X %X %X %X\n", hashHello1, hashHello2, hashHello4, crc32("hello5"));  // 5) Always Run Time. (!!!)
00007FF717B71C79  xor         edx,edx  
00007FF717B71C7B  lea         rcx,[string "hello5" (07FF717B7B13Ch)]  
00007FF717B71C82  call        crc32 (07FF717B710C3h)  
00007FF717B71C87  mov         dword ptr [rsp+20h],eax  
00007FF717B71C8B  mov         r9d,0BECA76E1h  
00007FF717B71C91  mov         r8d,dword ptr [hashHello2]  
00007FF717B71C95  mov         edx,dword ptr [hashHello1]  
00007FF717B71C98  lea         rcx,[string "%X %X %X %X\n" (07FF717B7B150h)]  
00007FF717B71C9F  call        printf (07FF717B711FEh)  

As you can see, there is a “call crc32” in the assembly for every place that we call crc32 in the c++ code – 4 crc32 calls in the c++, and 4 crc32 calls in the asm. That means that all of these crc32 calls happen at run time while in debug mode.

I can sort of see the reasoning for always doing constexpr at runtime in debug mode, since you probably want to be able to step through constexpr functions to see how they operate. I’d bet that is the reasoning here.

Let’s see what it compiles to in release. Release is a little bit harder to understand since the optimizations make it difficult/impossible to pair the c++ lines with the asm.

00007FF68DC010BA  lea         rdx,[string "Hello1"+1h (07FF68DC02211h)]  
00007FF68DC010C1  mov         ecx,7807C9A2h  
00007FF68DC010C6  call        crc32_rec (07FF68DC01070h)  
00007FF68DC010CB  lea         rdx,[string "Hello2"+1h (07FF68DC02219h)]  
00007FF68DC010D2  mov         ecx,7807C9A2h  
00007FF68DC010D7  mov         edi,eax  
00007FF68DC010D9  call        crc32_rec (07FF68DC01070h)  
00007FF68DC010DE  lea         rdx,[string "hello5"+1h (07FF68DC02221h)]  
00007FF68DC010E5  mov         ecx,4369E96Ah  
00007FF68DC010EA  mov         ebx,eax  
00007FF68DC010EC  call        crc32_rec (07FF68DC01070h)  
00007FF68DC010F1  mov         r9d,0BECA76E1h  
00007FF68DC010F7  mov         dword ptr [rsp+20h],eax  
00007FF68DC010FB  mov         r8d,ebx  
00007FF68DC010FE  lea         rcx,[string "%X %X %X %X\n" (07FF68DC02228h)]  
00007FF68DC01105  mov         edx,edi  
00007FF68DC01107  call        printf (07FF68DC01010h)  

We can see that in release, there are still 3 calls to crc32 which means that only one hash actually happens at compile time.

From the assembly we can easily see that “Hello1”, “Hello2” and “Hello5” are hashed at runtime. The assembly shows those strings as parameters to the function.

That leaves only “Hello4” remaining, which means that is the one that got hashed at compile time. You can actually see that on line 12, the value 0x0BECA76E1 is being moved into register r9d. If you step through the code in debug mode, you can see that the value of hashHello4 is actually 0x0BECA76E1, so that “move constant into register” on line 12 is the result of our hash happening at compile time. Pretty neat right?

I was actually surprised to see how many hashes remained happening at run time though, especially the one that is a parameter to printf. There really is no reason I can think of why they would need to remain happening at run time, versus happening at compile time, other than (this?) compiler not aggressively moving whatever it can to compile time. I really wish it worked more like that though, and IMO I think it should. Maybe in the future we’ll see compilers move more that direction.

Compile Time Hash Switching

Something neat about being able to do hashing at compile time, is that you can use the result of a compile time hash as a case value in a switch statement!

Let’s explore that a bit:

    unsigned int hash = crc32("Hello1");  // 1) Run Time.
    constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 2) Debug: Run Time. Release: Not calculated at all.
    switch (hash) { // 3) Uses variable on stack
        case hashTestHello2: {  // 4) Compile Time Constant.
            printf("A\n");
            break;
        }
        case crc32("Hello3"): {  // 5) Compile Time Constant.
            printf("B\n");
            break;
        }
        // 6) error C2196: case value '1470747604' already used
        /*
        case crc32("Hello2"): { 
            printf("C\n");
            break;
        }
        */
        default: {
            printf("C\n");
            break;
        }
    }

Something interesting to note is that if you have duplicate cases in your switch statement, due to things hashing to the same value (either duplicates, or actual hash collisions) that you will get a compile time error. This might come in handy, let’s come back to that later.

Let’s look at the assembly code in debug:

    unsigned int hash = crc32("Hello1");  // 1) Run Time.
00007FF77B4119FE  xor         edx,edx  
00007FF77B411A00  lea         rcx,[string "Hello1" (07FF77B41B124h)]  
00007FF77B411A07  call        crc32 (07FF77B4110C3h)  
00007FF77B411A0C  mov         dword ptr [hash],eax  
    constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 2) Debug: Run Time. Release: Not calculated at all.
00007FF77B411A0F  xor         edx,edx  
00007FF77B411A11  lea         rcx,[string "Hello2" (07FF77B41B12Ch)]  
00007FF77B411A18  call        crc32 (07FF77B4110C3h)  
00007FF77B411A1D  mov         dword ptr [hashTestHello2],eax  
    switch (hash) { // 3) Uses variable on stack
00007FF77B411A20  mov         eax,dword ptr [hash]  
00007FF77B411A23  mov         dword ptr [rbp+0F4h],eax  
00007FF77B411A29  cmp         dword ptr [rbp+0F4h],20AEE342h  
00007FF77B411A33  je          Snippet_CompileTimeHashSwitching1+71h (07FF77B411A51h)  
00007FF77B411A35  cmp         dword ptr [rbp+0F4h],57A9D3D4h  
00007FF77B411A3F  je          Snippet_CompileTimeHashSwitching1+63h (07FF77B411A43h)  
00007FF77B411A41  jmp         Snippet_CompileTimeHashSwitching1+7Fh (07FF77B411A5Fh)  
        case hashTestHello2: {  // 4) Compile Time Constant.
            printf("A\n");
00007FF77B411A43  lea         rcx,[string "A\n" (07FF77B41B144h)]  
00007FF77B411A4A  call        printf (07FF77B4111FEh)  
            break;
00007FF77B411A4F  jmp         Snippet_CompileTimeHashSwitching1+8Bh (07FF77B411A6Bh)  
        }
        case crc32("Hello3"): {  // 5) Compile Time Constant.
            printf("B\n");
00007FF77B411A51  lea         rcx,[string "B\n" (07FF77B41B160h)]  
00007FF77B411A58  call        printf (07FF77B4111FEh)  
            break;
00007FF77B411A5D  jmp         Snippet_CompileTimeHashSwitching1+8Bh (07FF77B411A6Bh)  
        }
        // 6) error C2196: case value '1470747604' already used
        /*
        case crc32("Hello2"): { 
            printf("C\n");
            break;
        }
        */
        default: {
            printf("C\n");
00007FF77B411A5F  lea         rcx,[string "C\n" (07FF77B41B164h)]  
00007FF77B411A66  call        printf (07FF77B4111FEh)  
            break;
        }
    }

We can see that the hash for “Hello1” and “Hello2” are both calculated at run time, and that the switch statement uses the stack variable [hash] to move the value into a register to do the switch statement.

Interestingly though, on lines 14 and 16 we can see it moving a constant value into registers to use in a cmp (compare) operation. 0x20AEE342 is the hash value of “Hello3” and 0x57A9D3D4 is the hash value of “Hello2” so it ended up doing those hashes at compile time, even though we are in debug mode. This is because case values must be known at compile time.

It’s interesting to see though that the compiler calculates hashTestHello2 at runtime, even though the only place we use it (in the case statement), it puts a compile time constant from a compile time hash. Odd.

Let’s see what happens in release:

00007FF7FA9B10B4  lea         rdx,[string "Hello1"+1h (07FF7FA9B2211h)]  
00007FF7FA9B10BB  mov         ecx,7807C9A2h  
00007FF7FA9B10C0  call        crc32_rec (07FF7FA9B1070h)  
00007FF7FA9B10C5  cmp         eax,20AEE342h  
00007FF7FA9B10CA  je          main+49h (07FF7FA9B10F9h)  
00007FF7FA9B10CC  cmp         eax,57A9D3D4h  
00007FF7FA9B10D1  je          main+36h (07FF7FA9B10E6h)  
00007FF7FA9B10D3  lea         rcx,[string "C\n" (07FF7FA9B2220h)]  
00007FF7FA9B10DA  call        printf (07FF7FA9B1010h)  
// ... More asm code below, but not relevant

Release is a little more lean which is nice. On line 3 we calculate the hash of “Hello1” at runtime, then on lines 4 and 6, we compare it against the constant values of our compile time hashes for “Hello2” and “Hello3”. That is all that is done at runtime, which is more in line with what we’d like to see the compiler do. It’s still a little bit lame that it didn’t see that “Hello1” was a compile time constant that was being hashed, and did it at runtime, but at least it got most of the hashing to happen at compile time.

What if we change the “hash” variable to be constexpr?

    // Release: this function just prints "C\n" and exits.  All code melted away at compile time!
    constexpr unsigned int hash = crc32("Hello1");  // 1) Debug: Run Time
    constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 2) Debug: Run Time
    switch (hash) { // 3) Debug: Compile Time Constant.
        case hashTestHello2: {   // 4) Debug: Compile Time Constant.
            printf("A\n");
            break;
        }
        case crc32("Hello3"): {   // 5) Debug: Compile Time Constant.
            printf("B\n");
            break;
        }
        default: {
            printf("C\n");
            break;
        }
    }

Let’s check it out in debug first:

constexpr unsigned int hash = crc32("Hello1");  // 1) Debug: Run Time
00007FF71F7A1ABE  xor         edx,edx  
00007FF71F7A1AC0  lea         rcx,[string "Hello1" (07FF71F7AB124h)]  
00007FF71F7A1AC7  call        crc32 (07FF71F7A10C3h)  
00007FF71F7A1ACC  mov         dword ptr [hash],eax  
    constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 2) Debug: Run Time
00007FF71F7A1ACF  xor         edx,edx  
00007FF71F7A1AD1  lea         rcx,[string "Hello2" (07FF71F7AB12Ch)]  
00007FF71F7A1AD8  call        crc32 (07FF71F7A10C3h)  
00007FF71F7A1ADD  mov         dword ptr [hashTestHello2],eax  
    switch (hash) { // 3) Debug: Compile Time Constant.
00007FF71F7A1AE0  mov         dword ptr [rbp+0F4h],0CEA0826Eh  
00007FF71F7A1AEA  cmp         dword ptr [rbp+0F4h],20AEE342h  
00007FF71F7A1AF4  je          Snippet_CompileTimeHashSwitching2+72h (07FF71F7A1B12h)  
00007FF71F7A1AF6  cmp         dword ptr [rbp+0F4h],57A9D3D4h  
00007FF71F7A1B00  je          Snippet_CompileTimeHashSwitching2+64h (07FF71F7A1B04h)  
00007FF71F7A1B02  jmp         Snippet_CompileTimeHashSwitching2+80h (07FF71F7A1B20h)  
        case hashTestHello2: {   // 4) Debug: Compile Time Constant.
            printf("A\n");
00007FF71F7A1B04  lea         rcx,[string "A\n" (07FF71F7AB144h)]  
        case hashTestHello2: {   // 4) Debug: Compile Time Constant.
            printf("A\n");
00007FF71F7A1B0B  call        printf (07FF71F7A11FEh)  
            break;
00007FF71F7A1B10  jmp         Snippet_CompileTimeHashSwitching2+8Ch (07FF71F7A1B2Ch)  
        }
        case crc32("Hello3"): {   // 5) Debug: Compile Time Constant.
            printf("B\n");
00007FF71F7A1B12  lea         rcx,[string "B\n" (07FF71F7AB160h)]  
00007FF71F7A1B19  call        printf (07FF71F7A11FEh)  
            break;
00007FF71F7A1B1E  jmp         Snippet_CompileTimeHashSwitching2+8Ch (07FF71F7A1B2Ch)  
        }
        default: {
            printf("C\n");
00007FF71F7A1B20  lea         rcx,[string "C\n" (07FF71F7AB164h)]  
00007FF71F7A1B27  call        printf (07FF71F7A11FEh)  
            break;
        }
    }

The code does a runtime hash of “Hello1” and “Hello2” on lines 4 and 9. Then, on line 12, it moves the compile time hash value of “Hello1” into memory. On line 13 it compares that against the compile time hash value of “Hello3”. On line 15 it compares it against the compile time hash value of “Hello2”.

Now let’s check release:

00007FF7B7B91074  lea         rcx,[string "C\n" (07FF7B7B92210h)]  
00007FF7B7B9107B  call        printf (07FF7B7B91010h)  

Awesome! It was able to do ALL calculation at compile time, and only do the printf at run time. Neat!

As one final test, let’s see what happens when we put a crc32 call straight into the switch statement.

    constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 1) Debug: Run Time. Release: Not calculated at all.
    switch (crc32("Hello1")) {  // 2) Always Run Time (!!!)
        case hashTestHello2: {  // 3) Compile Time Constant.
            printf("A\n");
            break;
        }
        case crc32("Hello3"): {  // 4) Compile Time Constant.
            printf("B\n");
            break;
        }
        default: {
            printf("C\n");
            break;
        }
    }

Here’s the debug assembly:

    constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 1) Debug: Run Time. Release: Not calculated at all.
00007FF72ED51B7E  xor         edx,edx  
00007FF72ED51B80  lea         rcx,[string "Hello2" (07FF72ED5B12Ch)]  
00007FF72ED51B87  call        crc32 (07FF72ED510C3h)  
00007FF72ED51B8C  mov         dword ptr [hashTestHello2],eax  
    switch (crc32("Hello1")) {  // 2) Always Run Time (!!!)
00007FF72ED51B8F  xor         edx,edx  
00007FF72ED51B91  lea         rcx,[string "Hello1" (07FF72ED5B124h)]  
00007FF72ED51B98  call        crc32 (07FF72ED510C3h)  
00007FF72ED51B9D  mov         dword ptr [rbp+0D4h],eax  
00007FF72ED51BA3  cmp         dword ptr [rbp+0D4h],20AEE342h  
00007FF72ED51BAD  je          Snippet_CompileTimeHashSwitching3+6Bh (07FF72ED51BCBh)  
00007FF72ED51BAF  cmp         dword ptr [rbp+0D4h],57A9D3D4h  
00007FF72ED51BB9  je          Snippet_CompileTimeHashSwitching3+5Dh (07FF72ED51BBDh)  
00007FF72ED51BBB  jmp         Snippet_CompileTimeHashSwitching3+79h (07FF72ED51BD9h)  
        case hashTestHello2: {  // 3) Compile Time Constant.
            printf("A\n");
00007FF72ED51BBD  lea         rcx,[string "A\n" (07FF72ED5B144h)]  
00007FF72ED51BC4  call        printf (07FF72ED511FEh)  
            break;
00007FF72ED51BC9  jmp         Snippet_CompileTimeHashSwitching3+85h (07FF72ED51BE5h)  
        }
        case crc32("Hello3"): {  // 4) Compile Time Constant.
            printf("B\n");
00007FF72ED51BCB  lea         rcx,[string "B\n" (07FF72ED5B160h)]  
00007FF72ED51BD2  call        printf (07FF72ED511FEh)  
            break;
00007FF72ED51BD7  jmp         Snippet_CompileTimeHashSwitching3+85h (07FF72ED51BE5h)  
        }
        default: {
            printf("C\n");
00007FF72ED51BD9  lea         rcx,[string "C\n" (07FF72ED5B164h)]  
00007FF72ED51BE0  call        printf (07FF72ED511FEh)  
            break;
        }
    }

It was able to do the case value crc’s at compile time, but the other two it did at runtime. Not surprising for debug. Let’s check release:

00007FF6A46D10B4  lea         rdx,[string "Hello1"+1h (07FF6A46D2211h)]  
00007FF6A46D10BB  mov         ecx,7807C9A2h  
00007FF6A46D10C0  call        crc32_rec (07FF6A46D1070h)  
00007FF6A46D10C5  cmp         eax,20AEE342h  
00007FF6A46D10CA  je          main+49h (07FF6A46D10F9h)  
00007FF6A46D10CC  cmp         eax,57A9D3D4h  
00007FF6A46D10D1  je          main+36h (07FF6A46D10E6h)  
00007FF6A46D10D3  lea         rcx,[string "C\n" (07FF6A46D2220h)]  
00007FF6A46D10DA  call        printf (07FF6A46D1010h)  
// ... More asm code below, but not relevant

It did the hash for “Hello1” at run time (why??), but it did the others at release. A bit disappointing that it couldn’t make the “Hello1” hash be compile time, but we saw this behavior before so nothing new there.

Leveraging Jump Tables

In the above switch statements, the tests for hashes were very much “If hash is a, do this, else if hash is b, do that”. It was an if/else if/else if style chain.

Switch statements actually have the ability to become jump tables though, which let them get to the right case value with fewer comparisons.

Check out this code:

    // Debug: Jump Table
    // Release: Just does the constant case, everything else goes away
    unsigned int i = 3;
    switch (i) {
        case 0: printf("A\n"); break;
        case 1: printf("B\n"); break;
        case 2: printf("C\n"); break;
        case 3: printf("D\n"); break;
        case 4: printf("E\n"); break;
        case 5: printf("F\n"); break;
        case 6: printf("G\n"); break;
        case 7: printf("H\n"); break;
        default: printf("None\n"); break;
    }

Here’s the debug assembly:

    // Debug: Jump Table
    // Release: Just does the constant case, everything else goes away
    unsigned int i = 3;
00007FF69CF9238E  mov         dword ptr [i],3  
    switch (i) {
00007FF69CF92395  mov         eax,dword ptr [i]  
00007FF69CF92398  mov         dword ptr [rbp+0D4h],eax  
00007FF69CF9239E  cmp         dword ptr [rbp+0D4h],7  
00007FF69CF923A5  ja          $LN11+0Eh (07FF69CF92434h)  
00007FF69CF923AB  mov         eax,dword ptr [rbp+0D4h]  
00007FF69CF923B1  lea         rcx,[__ImageBase (07FF69CF80000h)]  
00007FF69CF923B8  mov         eax,dword ptr [rcx+rax*4+1244Ch]  
00007FF69CF923BF  add         rax,rcx  
00007FF69CF923C2  jmp         rax  
        case 0: printf("A\n"); break;
00007FF69CF923C4  lea         rcx,[string "A\n" (07FF69CF9B144h)]  
00007FF69CF923CB  call        printf (07FF69CF911FEh)  
00007FF69CF923D0  jmp         $LN11+1Ah (07FF69CF92440h)  
        case 1: printf("B\n"); break;
00007FF69CF923D2  lea         rcx,[string "B\n" (07FF69CF9B160h)]  
00007FF69CF923D9  call        printf (07FF69CF911FEh)  
00007FF69CF923DE  jmp         $LN11+1Ah (07FF69CF92440h)  
        case 2: printf("C\n"); break;
00007FF69CF923E0  lea         rcx,[string "C\n" (07FF69CF9B164h)]  
00007FF69CF923E7  call        printf (07FF69CF911FEh)  
00007FF69CF923EC  jmp         $LN11+1Ah (07FF69CF92440h)  
        case 3: printf("D\n"); break;
00007FF69CF923EE  lea         rcx,[string "D\n" (07FF69CF9B168h)]  
00007FF69CF923F5  call        printf (07FF69CF911FEh)  
00007FF69CF923FA  jmp         $LN11+1Ah (07FF69CF92440h)  
        case 4: printf("E\n"); break;
00007FF69CF923FC  lea         rcx,[string "E\n" (07FF69CF9B16Ch)]  
00007FF69CF92403  call        printf (07FF69CF911FEh)  
00007FF69CF92408  jmp         $LN11+1Ah (07FF69CF92440h)  
        case 5: printf("F\n"); break;
00007FF69CF9240A  lea         rcx,[string "F\n" (07FF69CF9B170h)]  
00007FF69CF92411  call        printf (07FF69CF911FEh)  
00007FF69CF92416  jmp         $LN11+1Ah (07FF69CF92440h)  
        case 6: printf("G\n"); break;
00007FF69CF92418  lea         rcx,[string "G\n" (07FF69CF9B174h)]  
00007FF69CF9241F  call        printf (07FF69CF911FEh)  
00007FF69CF92424  jmp         $LN11+1Ah (07FF69CF92440h)  
        case 7: printf("H\n"); break;
00007FF69CF92426  lea         rcx,[string "H\n" (07FF69CF9B178h)]  
00007FF69CF9242D  call        printf (07FF69CF911FEh)  
00007FF69CF92432  jmp         $LN11+1Ah (07FF69CF92440h)  
        default: printf("None\n"); break;
00007FF69CF92434  lea         rcx,[string "None\n" (07FF69CF9AFDCh)]  
00007FF69CF9243B  call        printf (07FF69CF911FEh)  
    }

On line 8 and 9 it checks to see if the value we are switching on is greater than 7, and if so, jumps to 07FF69CF92434h, which is the “default” case where it prints “None”.

If the number is not greater than 7, it calculates an address based on the value we are switching on, and then jumps to it, on line 14.

Instead of testing for every possible value in an if/else if/else if/else if type setup, it can go IMMEDIATELY to the code associated with the particular case statement.

This is a jump table and can be a big speed improvement if you have a lot of cases, or if the code is called a lot.

I’ll spare you the release assembly. The compiler can tell that this is a compile time known result and only has the printf for the “D” without any of the other logic.

Let’s go back to our crc32 code and try and leverage a jump table:

    // Debug / Release: Does hash and jump table.
    // Note: It AND's against 7 and then tests to see if i is greater than 7 (!!!)
    unsigned int i = crc32("Hello") & 7;
    switch (i) {
        case 0: printf("A\n"); break;
        case 1: printf("B\n"); break;
        case 2: printf("C\n"); break;
        case 3: printf("D\n"); break;
        case 4: printf("E\n"); break;
        case 5: printf("F\n"); break;
        case 6: printf("G\n"); break;
        case 7: printf("H\n"); break;
        default: printf("None\n"); break;
    }

I’ll show you just the debug assembly for brevity. It does the hash and jump table at runtime for both debug and release. Interestingly, even though we do &7 on the hash value, the switch statement STILL makes sure that the value being switched on is not greater than 7. It can never be greater than 7, and that can be known at compile time, but it still checks. This is true even of the release assembly!

    // Debug / Release: Does hash and jump table.
    // Note: It AND's against 7 and then tests to see if i is greater than 7 (!!!)
    unsigned int i = crc32("Hello") & 7;
00007FF7439C24CE  xor         edx,edx  
00007FF7439C24D0  lea         rcx,[string "Hello" (07FF7439CB180h)]  
00007FF7439C24D7  call        crc32 (07FF7439C10C3h)  
00007FF7439C24DC  and         eax,7  
00007FF7439C24DF  mov         dword ptr [i],eax  
    switch (i) {
00007FF7439C24E2  mov         eax,dword ptr [i]  
00007FF7439C24E5  mov         dword ptr [rbp+0D4h],eax  
00007FF7439C24EB  cmp         dword ptr [rbp+0D4h],7  
00007FF7439C24F2  ja          $LN11+0Eh (07FF7439C2581h)  
00007FF7439C24F8  mov         eax,dword ptr [rbp+0D4h]  
00007FF7439C24FE  lea         rcx,[__ImageBase (07FF7439B0000h)]  
00007FF7439C2505  mov         eax,dword ptr [rcx+rax*4+12598h]  
00007FF7439C250C  add         rax,rcx  
00007FF7439C250F  jmp         rax  
        case 0: printf("A\n"); break;
00007FF7439C2511  lea         rcx,[string "A\n" (07FF7439CB144h)]  
        case 0: printf("A\n"); break;
00007FF7439C2518  call        printf (07FF7439C11FEh)  
00007FF7439C251D  jmp         $LN11+1Ah (07FF7439C258Dh)  
        case 1: printf("B\n"); break;
00007FF7439C251F  lea         rcx,[string "B\n" (07FF7439CB160h)]  
00007FF7439C2526  call        printf (07FF7439C11FEh)  
00007FF7439C252B  jmp         $LN11+1Ah (07FF7439C258Dh)  
        case 2: printf("C\n"); break;
00007FF7439C252D  lea         rcx,[string "C\n" (07FF7439CB164h)]  
00007FF7439C2534  call        printf (07FF7439C11FEh)  
00007FF7439C2539  jmp         $LN11+1Ah (07FF7439C258Dh)  
        case 3: printf("D\n"); break;
00007FF7439C253B  lea         rcx,[string "D\n" (07FF7439CB168h)]  
00007FF7439C2542  call        printf (07FF7439C11FEh)  
00007FF7439C2547  jmp         $LN11+1Ah (07FF7439C258Dh)  
        case 4: printf("E\n"); break;
00007FF7439C2549  lea         rcx,[string "E\n" (07FF7439CB16Ch)]  
00007FF7439C2550  call        printf (07FF7439C11FEh)  
00007FF7439C2555  jmp         $LN11+1Ah (07FF7439C258Dh)  
        case 5: printf("F\n"); break;
00007FF7439C2557  lea         rcx,[string "F\n" (07FF7439CB170h)]  
00007FF7439C255E  call        printf (07FF7439C11FEh)  
00007FF7439C2563  jmp         $LN11+1Ah (07FF7439C258Dh)  
        case 6: printf("G\n"); break;
00007FF7439C2565  lea         rcx,[string "G\n" (07FF7439CB174h)]  
00007FF7439C256C  call        printf (07FF7439C11FEh)  
00007FF7439C2571  jmp         $LN11+1Ah (07FF7439C258Dh)  
        case 7: printf("H\n"); break;
00007FF7439C2573  lea         rcx,[string "H\n" (07FF7439CB178h)]  
00007FF7439C257A  call        printf (07FF7439C11FEh)  
00007FF7439C257F  jmp         $LN11+1Ah (07FF7439C258Dh)  
        default: printf("None\n"); break;
00007FF7439C2581  lea         rcx,[string "None\n" (07FF7439CAFDCh)]  
00007FF7439C2588  call        printf (07FF7439C11FEh)  
    }

If we make “i” be a constexpr variable, it doesn’t affect debug, but in release, it is able to melt away all the code and just prints the correct case.

00007FF6093F1074  lea         rcx,[string "A\n" (07FF6093F2210h)]  
00007FF6093F107B  call        printf (07FF6093F1010h)  

Perfect Hashing

Perfect hashing is when you have a known set of inputs, and a hash function such that there is no collision between any of the inputs.

Perfect hashing can be great for being able to turn complex objects into IDs for faster lookup times, but the IDs are not usually going to be contiguous. One ID might be 3, and another might be 45361. Perfect hashing can still be useful though.

The code below shows show some compile time perfect hashing could be achieved.

    // Debug: Does have some sort of jump table setup, despite the cases not being continuous.
    // Release: prints "A\n".  All other code melts away at compile time.
    static const unsigned int c_numBuckets = 16;
    static const unsigned int c_salt = 1337;

    constexpr unsigned int i = crc32("Identifier_A", c_salt) % c_numBuckets;
    switch (i) {
        case (crc32("Identifier_A", c_salt) % c_numBuckets): printf("A\n"); break;
        case (crc32("Identifier_B", c_salt) % c_numBuckets): printf("B\n"); break;
        case (crc32("Identifier_C", c_salt) % c_numBuckets): printf("C\n"); break;
        case (crc32("Identifier_D", c_salt) % c_numBuckets): printf("D\n"); break;
        case (crc32("Identifier_E", c_salt) % c_numBuckets): printf("E\n"); break;
        case (crc32("Identifier_F", c_salt) % c_numBuckets): printf("F\n"); break;
        case (crc32("Identifier_G", c_salt) % c_numBuckets): printf("G\n"); break;
        case (crc32("Identifier_H", c_salt) % c_numBuckets): printf("H\n"); break;
        default: printf("None\n"); break;
    }

One nice thing about doing perfect hashing at compile time like this is that if you ever have a hash collision, you’ll have a duplicate case in your switch statement, and will get a compile error. This means that you are guaranteed that your perfect hash is valid at runtime. With non compile time perfect hashes, you could easily get into a situation where you added some more valid inputs and may now have a hash collision, which would make hard to track subtle bugs as two input values would be sharing the same index for whatever read and/or write data they wanted to interact with.

You might notice that i am doing %16 on the hash instead of %8, even though there are only 8 items I want to test against.

The reason for that is hash collisions.

When you hash a string you are effectively getting a pseudo random number back that will always be the same for when that string is given as input.

For the above to work correctly and let me modulus against 8, i would have to roll an 8 sided dice 8 times and get no repeats.

There are 8! (8 factorial) different ways to roll that 8 sided dice 8 times and not get any collisions.

There are 8^8 different ways to roll the 8 sided dice 8 times total.

To get the chance that we roll an 8 sided dice 8 times and get no repeats (no hash collisions), the probability is 8! / 8^8 or 0.24%.

The salt in the crc32 function allows us to effectively re-roll our dice. Each salt value is a different roll of the dice.

How that fits in is that 0.24% of all salt values should let us be able to do %8 on our hashes and not have a hash collision.

Those odds are pretty bad, but they aren’t SUPER bad. We could just brute force search to find a good salt value to use and then hard code the salt in, like i did in the example above.

Unfortunately, the probability I calculated above only works if the hash function gives uniformly distributed output and is “truly random”.

In practice no hash function or pseudo random number generator is, and in fact this crc32 function has NO salt values which make for no collisions! I brute force tested all 2^32 (4.2 billion) possible salt values and came up with no salt that worked!

To get around that problem, instead of trying to get a perfect fit of 8 hashes with values 0-7 without collisions, i opted to go for 8 hashes with values 0-15 with no collisions. That changes my odds for the better, and there are in fact many salts that satisfy that.

It’s the equivalent of rolling a 16 sided dice 8 times without repeats.

Thinking about things a bit differently than before, the first roll has a 100% chance of not being a duplicate. The next roll has a 15/16 chance of not being duplicate. The next has a 14/16 chance and so on until the 8th roll which has a 9/16 chance.

Multiplying those all together, we end up with a 12.08% chance of rolling a 16 sided dice 8 times and not getting a duplicate. That means 12% of the salts (about 1 in 9) won’t produce collisions when we use 16 buckets, which makes it much easier for us to find a salt to use.

Looking at the disassembly in debug, we can see the jump table is miraculously in tact! This is great because now we can get compile time assured perfect hashing of objects, and can use a jump table to convert a runtime object into a perfect hash result.

Note that in release, the code melts away and just does the “correct printf”.

    // Debug: Does have some sort of jump table setup, despite the cases not being continuous.
    // Release: prints "A\n".  All other code melts away at compile time.
    static const unsigned int c_numBuckets = 16;
    static const unsigned int c_salt = 1337;

    constexpr unsigned int i = crc32("Identifier_A", c_salt) % c_numBuckets;
00007FF7CE651E5E  mov         edx,539h  
00007FF7CE651E63  lea         rcx,[string "Identifier_A" (07FF7CE65B190h)]  
00007FF7CE651E6A  call        crc32 (07FF7CE6510C3h)  
00007FF7CE651E6F  xor         edx,edx  
00007FF7CE651E71  mov         ecx,10h  
00007FF7CE651E76  div         eax,ecx  
00007FF7CE651E78  mov         eax,edx  
00007FF7CE651E7A  mov         dword ptr [i],eax  
    switch (i) {
00007FF7CE651E7D  mov         dword ptr [rbp+0D4h],4  
00007FF7CE651E87  cmp         dword ptr [rbp+0D4h],0Eh  
00007FF7CE651E8E  ja          $LN11+0Eh (07FF7CE651F1Dh)  
00007FF7CE651E94  mov         eax,dword ptr [rbp+0D4h]  
00007FF7CE651E9A  lea         rcx,[__ImageBase (07FF7CE640000h)]  
00007FF7CE651EA1  mov         eax,dword ptr [rcx+rax*4+11F34h]  
00007FF7CE651EA8  add         rax,rcx  
00007FF7CE651EAB  jmp         rax  
        case (crc32("Identifier_A", c_salt) % c_numBuckets): printf("A\n"); break;
00007FF7CE651EAD  lea         rcx,[string "A\n" (07FF7CE65B144h)]  
00007FF7CE651EB4  call        printf (07FF7CE6511FEh)  
00007FF7CE651EB9  jmp         $LN11+1Ah (07FF7CE651F29h)  
        case (crc32("Identifier_B", c_salt) % c_numBuckets): printf("B\n"); break;
00007FF7CE651EBB  lea         rcx,[string "B\n" (07FF7CE65B160h)]  
00007FF7CE651EC2  call        printf (07FF7CE6511FEh)  
00007FF7CE651EC7  jmp         $LN11+1Ah (07FF7CE651F29h)  
        case (crc32("Identifier_C", c_salt) % c_numBuckets): printf("C\n"); break;
00007FF7CE651EC9  lea         rcx,[string "C\n" (07FF7CE65B164h)]  
00007FF7CE651ED0  call        printf (07FF7CE6511FEh)  
00007FF7CE651ED5  jmp         $LN11+1Ah (07FF7CE651F29h)  
        case (crc32("Identifier_D", c_salt) % c_numBuckets): printf("D\n"); break;
00007FF7CE651ED7  lea         rcx,[string "D\n" (07FF7CE65B168h)]  
00007FF7CE651EDE  call        printf (07FF7CE6511FEh)  
00007FF7CE651EE3  jmp         $LN11+1Ah (07FF7CE651F29h)  
        case (crc32("Identifier_E", c_salt) % c_numBuckets): printf("E\n"); break;
00007FF7CE651EE5  lea         rcx,[string "E\n" (07FF7CE65B16Ch)]  
00007FF7CE651EEC  call        printf (07FF7CE6511FEh)  
00007FF7CE651EF1  jmp         $LN11+1Ah (07FF7CE651F29h)  
        case (crc32("Identifier_F", c_salt) % c_numBuckets): printf("F\n"); break;
00007FF7CE651EF3  lea         rcx,[string "F\n" (07FF7CE65B170h)]  
00007FF7CE651EFA  call        printf (07FF7CE6511FEh)  
00007FF7CE651EFF  jmp         $LN11+1Ah (07FF7CE651F29h)  
        case (crc32("Identifier_G", c_salt) % c_numBuckets): printf("G\n"); break;
00007FF7CE651F01  lea         rcx,[string "G\n" (07FF7CE65B174h)]  
00007FF7CE651F08  call        printf (07FF7CE6511FEh)  
00007FF7CE651F0D  jmp         $LN11+1Ah (07FF7CE651F29h)  
        case (crc32("Identifier_H", c_salt) % c_numBuckets): printf("H\n"); break;
00007FF7CE651F0F  lea         rcx,[string "H\n" (07FF7CE65B178h)]  
00007FF7CE651F16  call        printf (07FF7CE6511FEh)  
00007FF7CE651F1B  jmp         $LN11+1Ah (07FF7CE651F29h)  
        default: printf("None\n"); break;
00007FF7CE651F1D  lea         rcx,[string "None\n" (07FF7CE65AFDCh)]  
00007FF7CE651F24  call        printf (07FF7CE6511FEh)  
    }

Minimally Perfect Hashing

Minimally perfect hashing is like perfect hashing, except the results are contiguous values.

If you have 8 possible inputs to your minimally perfect hash function, you are going to get as output 0-7. The order of what inputs map to which outputs isn’t strictly defined unless you want to go through a lot of extra effort to make it be that way though.

This is even more useful than perfect hashing, as you can hash a (known good) input and use the result as an index into an array, or similar!

For more info on MPH, check out my post on it: O(1) Data Lookups With Minimal Perfect Hashing

The code below is a way of doing compile time assisted minimally perfect hashing:

    // Debug / Release:
    //   Runs crc32 at runtime only for "i".  The cases are compile time constants as per usual.
    //   Does a jumptable type setup for the switch and does fallthrough to do multiple increments to get the right ID.
    //
    // Release with constexpr on i:
    //   does the printf with a value of 2.  The rest of the code melts away.
    static const unsigned int c_numBuckets = 16;
    static const unsigned int c_salt = 1337;
    static const unsigned int c_invalidID = -1;

    unsigned int i = crc32("Identifier_F", c_salt) % c_numBuckets;
    unsigned int id = c_invalidID;
    switch (i) {
        case (crc32("Identifier_A", c_salt) % c_numBuckets): ++id;
        case (crc32("Identifier_B", c_salt) % c_numBuckets): ++id;
        case (crc32("Identifier_C", c_salt) % c_numBuckets): ++id;
        case (crc32("Identifier_D", c_salt) % c_numBuckets): ++id;
        case (crc32("Identifier_E", c_salt) % c_numBuckets): ++id;
        case (crc32("Identifier_F", c_salt) % c_numBuckets): ++id;
        case (crc32("Identifier_G", c_salt) % c_numBuckets): ++id;
        case (crc32("Identifier_H", c_salt) % c_numBuckets): ++id; 
        // the two lines below are implicit behavior of how this code works
        // break;
        // default: id = c_invalidID; break;
    }

    printf("id = %i\n", id);

Here’s the debug assembly for the above. The release does similar, so i’m not showing it.

    // Debug / Release:
    //   Runs crc32 at runtime only for "i".  The cases are compile time constants as per usual.
    //   Does a jumptable type setup for the switch and does fallthrough to do multiple increments to get the right ID.
    //
    // Release with constexpr on i:
    //   does the printf with a value of 2.  The rest of the code melts away.
    static const unsigned int c_numBuckets = 16;
    static const unsigned int c_salt = 1337;
    static const unsigned int c_invalidID = -1;

    unsigned int i = crc32("Identifier_F", c_salt) % c_numBuckets;
00007FF79E481D0E  mov         edx,539h  
00007FF79E481D13  lea         rcx,[string "Identifier_F" (07FF79E48B1E0h)]  
00007FF79E481D1A  call        crc32 (07FF79E4810C3h)  
00007FF79E481D1F  xor         edx,edx  
00007FF79E481D21  mov         ecx,10h  
00007FF79E481D26  div         eax,ecx  
00007FF79E481D28  mov         eax,edx  
00007FF79E481D2A  mov         dword ptr [i],eax  
    unsigned int id = c_invalidID;
00007FF79E481D2D  mov         dword ptr [id],0FFFFFFFFh  
    switch (i) {
00007FF79E481D34  mov         eax,dword ptr [i]  
00007FF79E481D37  mov         dword ptr [rbp+0F4h],eax  
00007FF79E481D3D  cmp         dword ptr [rbp+0F4h],0Eh  
00007FF79E481D44  ja          $LN11+8h (07FF79E481D9Fh)  
00007FF79E481D46  mov         eax,dword ptr [rbp+0F4h]  
00007FF79E481D4C  lea         rcx,[__ImageBase (07FF79E470000h)]  
00007FF79E481D53  mov         eax,dword ptr [rcx+rax*4+11DB8h]  
00007FF79E481D5A  add         rax,rcx  
00007FF79E481D5D  jmp         rax  
        case (crc32("Identifier_A", c_salt) % c_numBuckets): ++id;
00007FF79E481D5F  mov         eax,dword ptr [id]  
00007FF79E481D62  inc         eax  
00007FF79E481D64  mov         dword ptr [id],eax  
        case (crc32("Identifier_B", c_salt) % c_numBuckets): ++id;
00007FF79E481D67  mov         eax,dword ptr [id]  
00007FF79E481D6A  inc         eax  
00007FF79E481D6C  mov         dword ptr [id],eax  
        case (crc32("Identifier_C", c_salt) % c_numBuckets): ++id;
00007FF79E481D6F  mov         eax,dword ptr [id]  
00007FF79E481D72  inc         eax  
00007FF79E481D74  mov         dword ptr [id],eax  
        case (crc32("Identifier_D", c_salt) % c_numBuckets): ++id;
00007FF79E481D77  mov         eax,dword ptr [id]  
00007FF79E481D7A  inc         eax  
00007FF79E481D7C  mov         dword ptr [id],eax  
        case (crc32("Identifier_E", c_salt) % c_numBuckets): ++id;
00007FF79E481D7F  mov         eax,dword ptr [id]  
00007FF79E481D82  inc         eax  
00007FF79E481D84  mov         dword ptr [id],eax  
        case (crc32("Identifier_F", c_salt) % c_numBuckets): ++id;
00007FF79E481D87  mov         eax,dword ptr [id]  
00007FF79E481D8A  inc         eax  
00007FF79E481D8C  mov         dword ptr [id],eax  
        case (crc32("Identifier_G", c_salt) % c_numBuckets): ++id;
00007FF79E481D8F  mov         eax,dword ptr [id]  
00007FF79E481D92  inc         eax  
00007FF79E481D94  mov         dword ptr [id],eax  
        case (crc32("Identifier_H", c_salt) % c_numBuckets): ++id; 
00007FF79E481D97  mov         eax,dword ptr [id]  
00007FF79E481D9A  inc         eax  
00007FF79E481D9C  mov         dword ptr [id],eax  
        // the two lines below are implicit behavior of how this code works
        // break;
        // default: id = c_invalidID; break;
    }

    printf("id = %i\n", id);
00007FF79E481D9F  mov         edx,dword ptr [id]  
00007FF79E481DA2  lea         rcx,[string "id = %i\n" (07FF79E48B238h)]  
        // the two lines below are implicit behavior of how this code works
        // break;
        // default: id = c_invalidID; break;
    }

    printf("id = %i\n", id);
00007FF79E481DA9  call        printf (07FF79E4811FEh)  

As you can see, the jump table is still in tact, which is good, but it does a lot of repeated increments to get the right ID values. I wish the compiler were smart enough to “flatten” this and just give each case it’s proper ID value.

As is, this could be a performance issue if you had a very large number of inputs.

You could always just hard code a number there instead of relying on the fallthrough and increment, but then there is a lot of copy pasting. Maybe you could do something clever with macros or templates to help that though.

Compile Time Assisted String To Enum

Another interesting thing to think about is that we could actually use compile time hashing to convert a string to an enum.

In this case, let’s say that we don’t know if our input is valid or not. Since we don’t know that, we have to switch on the hash of our input string, but then do a string compare against whatever string has that hash, to make sure it matches. If it does match, it should take that enum value, else it should be invalid.

Since that would be a lot of error prone copy/pasting, I simplified things a bit by using a macro list:

    // Debug / Release:
    //   Runs crc32 at runtime only for "i".  The cases are compile time constants as per usual.
    //   Does a jumptable type setup for the switch and does a string comparison against the correct string.
    //   If strings are equal, sets the enum value.

    static const unsigned int c_numBuckets = 16;
    static const unsigned int c_salt = 1337;

    const char* testString = "Identifier_F";
    unsigned int i = crc32(testString, c_salt) % c_numBuckets;

    // This macro list is used for:
    //  * making the enum
    //  * making the cases in the switch statement
    // D.R.Y. - Don't Repeat Yourself.
    // Fewer moving parts = fewer errors, but admittedly is harder to understand vs redundant code.
    #define ENUM_VALUE_LIST \
        VALUE(Identifier_A) \
        VALUE(Identifier_B) \
        VALUE(Identifier_C) \
        VALUE(Identifier_D) \
        VALUE(Identifier_E) \
        VALUE(Identifier_F) \
        VALUE(Identifier_G) \
        VALUE(Identifier_H) 

    // Make the enum values.
    // Note these enum values are also usable as a contiguous ID if you needed one for an array index or similar.
    // You could define an array with size EIdentifier::count for instance and use these IDs to index into it.
    enum class EIdentifier : unsigned char {
        #define VALUE(x) x,
        ENUM_VALUE_LIST
        #undef VALUE
        count,
        invalid = (unsigned char)-1
    };

    // do a compile time hash assisted string comparison to convert string to enum
    EIdentifier identifier = EIdentifier::invalid;
    switch (i) {
        #define VALUE(x) case (crc32(#x, c_salt) % c_numBuckets) : if(!strcmp(testString, #x)) identifier = EIdentifier::x; else identifier = EIdentifier::invalid; break;
        ENUM_VALUE_LIST
        #undef VALUE
        default: identifier = EIdentifier::invalid;
    }
    
    // undefine the enum value list
    #undef ENUM_VALUE_LIST

    printf("string translated to enum value %i", identifier);

Here’s the debug assembly showing it working like it’s supposed to:

    // Debug / Release:
    //   Runs crc32 at runtime only for "i".  The cases are compile time constants as per usual.
    //   Does a jumptable type setup for the switch and does a string comparison against the correct string.
    //   If strings are equal, sets the enum value.

    static const unsigned int c_numBuckets = 16;
    static const unsigned int c_salt = 1337;

    const char* testString = "Identifier_F";
00007FF6B188179E  lea         rax,[string "Identifier_F" (07FF6B188B1E0h)]  
00007FF6B18817A5  mov         qword ptr [testString],rax  
    unsigned int i = crc32(testString, c_salt) % c_numBuckets;
00007FF6B18817A9  mov         edx,539h  
00007FF6B18817AE  mov         rcx,qword ptr [testString]  
00007FF6B18817B2  call        crc32 (07FF6B18810C3h)  
00007FF6B18817B7  xor         edx,edx  
00007FF6B18817B9  mov         ecx,10h  
00007FF6B18817BE  div         eax,ecx  
00007FF6B18817C0  mov         eax,edx  
00007FF6B18817C2  mov         dword ptr [i],eax  

    // This macro list is used for:
    //  * making the enum
    //  * making the cases in the switch statement
    // D.R.Y. - Don't Repeat Yourself.
    // Fewer moving parts = fewer errors, but admittedly is harder to understand vs redundant code.
    #define ENUM_VALUE_LIST \
        VALUE(Identifier_A) \
        VALUE(Identifier_B) \
        VALUE(Identifier_C) \
        VALUE(Identifier_D) \
        VALUE(Identifier_E) \
        VALUE(Identifier_F) \
        VALUE(Identifier_G) \
        VALUE(Identifier_H) 

    // Make the enum values.
    // Note these enum values are also usable as a contiguous ID if you needed one for an array index or similar.
    // You could define an array with size EIdentifier::count for instance and use these IDs to index into it.
    enum class EIdentifier : unsigned char {
        #define VALUE(x) x,
        ENUM_VALUE_LIST
        #undef VALUE
        count,
        invalid = (unsigned char)-1
    };

    // do a compile time hash assisted string comparison to convert string to enum
    EIdentifier identifier = EIdentifier::invalid;
00007FF6B18817C5  mov         byte ptr [identifier],0FFh  
    switch (i) {
00007FF6B18817C9  mov         eax,dword ptr [i]  
00007FF6B18817CC  mov         dword ptr [rbp+114h],eax  
00007FF6B18817D2  cmp         dword ptr [rbp+114h],0Eh  
    switch (i) {
00007FF6B18817D9  ja          $LN25+20h (07FF6B1881904h)  
00007FF6B18817DF  mov         eax,dword ptr [rbp+114h]  
00007FF6B18817E5  lea         rcx,[__ImageBase (07FF6B1870000h)]  
00007FF6B18817EC  mov         eax,dword ptr [rcx+rax*4+11924h]  
00007FF6B18817F3  add         rax,rcx  
00007FF6B18817F6  jmp         rax  
        #define VALUE(x) case (crc32(#x, c_salt) % c_numBuckets) : if(!strcmp(testString, #x)) identifier = EIdentifier::x; else identifier = EIdentifier::invalid; break;
        ENUM_VALUE_LIST
00007FF6B18817F8  lea         rdx,[string "Identifier_A" (07FF6B188B190h)]  
00007FF6B18817FF  mov         rcx,qword ptr [testString]  
00007FF6B1881803  call        strcmp (07FF6B18811CCh)  
00007FF6B1881808  test        eax,eax  
00007FF6B188180A  jne         Snippet_CompileTimeHashAssistedStringToEnum+92h (07FF6B1881812h)  
00007FF6B188180C  mov         byte ptr [identifier],0  
00007FF6B1881810  jmp         Snippet_CompileTimeHashAssistedStringToEnum+96h (07FF6B1881816h)  
00007FF6B1881812  mov         byte ptr [identifier],0FFh  
00007FF6B1881816  jmp         $LN25+24h (07FF6B1881908h)  
$LN7:
00007FF6B188181B  lea         rdx,[string "Identifier_B" (07FF6B188B1A0h)]  
00007FF6B1881822  mov         rcx,qword ptr [testString]  
00007FF6B1881826  call        strcmp (07FF6B18811CCh)  
00007FF6B188182B  test        eax,eax  
00007FF6B188182D  jne         Snippet_CompileTimeHashAssistedStringToEnum+0B5h (07FF6B1881835h)  
00007FF6B188182F  mov         byte ptr [identifier],1  
00007FF6B1881833  jmp         Snippet_CompileTimeHashAssistedStringToEnum+0B9h (07FF6B1881839h)  
00007FF6B1881835  mov         byte ptr [identifier],0FFh  
00007FF6B1881839  jmp         $LN25+24h (07FF6B1881908h)  
$LN10:
00007FF6B188183E  lea         rdx,[string "Identifier_C" (07FF6B188B1B0h)]  
00007FF6B1881845  mov         rcx,qword ptr [testString]  
00007FF6B1881849  call        strcmp (07FF6B18811CCh)  
00007FF6B188184E  test        eax,eax  
00007FF6B1881850  jne         Snippet_CompileTimeHashAssistedStringToEnum+0D8h (07FF6B1881858h)  
00007FF6B1881852  mov         byte ptr [identifier],2  
00007FF6B1881856  jmp         Snippet_CompileTimeHashAssistedStringToEnum+0DCh (07FF6B188185Ch)  
00007FF6B1881858  mov         byte ptr [identifier],0FFh  
00007FF6B188185C  jmp         $LN25+24h (07FF6B1881908h)  
$LN13:
00007FF6B1881861  lea         rdx,[string "Identifier_D" (07FF6B188B1C0h)]  
00007FF6B1881868  mov         rcx,qword ptr [testString]  
00007FF6B188186C  call        strcmp (07FF6B18811CCh)  
00007FF6B1881871  test        eax,eax  
00007FF6B1881873  jne         Snippet_CompileTimeHashAssistedStringToEnum+0FBh (07FF6B188187Bh)  
00007FF6B1881875  mov         byte ptr [identifier],3  
00007FF6B1881879  jmp         Snippet_CompileTimeHashAssistedStringToEnum+0FFh (07FF6B188187Fh)  
00007FF6B188187B  mov         byte ptr [identifier],0FFh  
00007FF6B188187F  jmp         $LN25+24h (07FF6B1881908h)  
$LN16:
00007FF6B1881884  lea         rdx,[string "Identifier_E" (07FF6B188B1D0h)]  
00007FF6B188188B  mov         rcx,qword ptr [testString]  
00007FF6B188188F  call        strcmp (07FF6B18811CCh)  
00007FF6B1881894  test        eax,eax  
00007FF6B1881896  jne         Snippet_CompileTimeHashAssistedStringToEnum+11Eh (07FF6B188189Eh)  
00007FF6B1881898  mov         byte ptr [identifier],4  
00007FF6B188189C  jmp         Snippet_CompileTimeHashAssistedStringToEnum+122h (07FF6B18818A2h)  
00007FF6B188189E  mov         byte ptr [identifier],0FFh  
00007FF6B18818A2  jmp         $LN25+24h (07FF6B1881908h)  
$LN19:
00007FF6B18818A4  lea         rdx,[string "Identifier_F" (07FF6B188B1E0h)]  
00007FF6B18818AB  mov         rcx,qword ptr [testString]  
00007FF6B18818AF  call        strcmp (07FF6B18811CCh)  
00007FF6B18818B4  test        eax,eax  
00007FF6B18818B6  jne         Snippet_CompileTimeHashAssistedStringToEnum+13Eh (07FF6B18818BEh)  
00007FF6B18818B8  mov         byte ptr [identifier],5  
00007FF6B18818BC  jmp         Snippet_CompileTimeHashAssistedStringToEnum+142h (07FF6B18818C2h)  
00007FF6B18818BE  mov         byte ptr [identifier],0FFh  
00007FF6B18818C2  jmp         $LN25+24h (07FF6B1881908h)  
$LN22:
00007FF6B18818C4  lea         rdx,[string "Identifier_G" (07FF6B188B1F0h)]  
00007FF6B18818CB  mov         rcx,qword ptr [testString]  
00007FF6B18818CF  call        strcmp (07FF6B18811CCh)  
00007FF6B18818D4  test        eax,eax  
00007FF6B18818D6  jne         Snippet_CompileTimeHashAssistedStringToEnum+15Eh (07FF6B18818DEh)  
00007FF6B18818D8  mov         byte ptr [identifier],6  
00007FF6B18818DC  jmp         Snippet_CompileTimeHashAssistedStringToEnum+162h (07FF6B18818E2h)  
00007FF6B18818DE  mov         byte ptr [identifier],0FFh  
00007FF6B18818E2  jmp         $LN25+24h (07FF6B1881908h)  
$LN25:
00007FF6B18818E4  lea         rdx,[string "Identifier_H" (07FF6B188B200h)]  
00007FF6B18818EB  mov         rcx,qword ptr [testString]  
00007FF6B18818EF  call        strcmp (07FF6B18811CCh)  
00007FF6B18818F4  test        eax,eax  
00007FF6B18818F6  jne         $LN25+1Ah (07FF6B18818FEh)  
00007FF6B18818F8  mov         byte ptr [identifier],7  
00007FF6B18818FC  jmp         $LN25+1Eh (07FF6B1881902h)  
00007FF6B18818FE  mov         byte ptr [identifier],0FFh  
00007FF6B1881902  jmp         $LN25+24h (07FF6B1881908h)  
        #undef VALUE
        default: identifier = EIdentifier::invalid;
00007FF6B1881904  mov         byte ptr [identifier],0FFh  
    }
    
    // undefine the enum value list
    #undef ENUM_VALUE_LIST

    printf("string translated to enum value %i", identifier);
00007FF6B1881908  movzx       eax,byte ptr [identifier]  
00007FF6B188190C  mov         edx,eax  
00007FF6B188190E  lea         rcx,[string "string translated to enum value "... (07FF6B188B248h)]  
00007FF6B1881915  call        printf (07FF6B18811FEh)  

Other Possibilities

Besides the above, I think there is a lot of other really great uses for constexpr functions.

For instance, i’d like to see how it’d work to have compile time data structures to do faster read access of constant data.

I want to see compile time trees, hash tables, sparse arrays, bloom filters, and more.

I believe they have the potential to be a lot more performant than even static data structures, since empty or unaccessed sections of the data structure could possibly melt away when the optimizer does it’s pass.

It may not turn out like that, but I think it’d be interesting to investigate it deeper and see how it goes.

I’d also like to see compilers get more aggressive about doing whatever it can at compile time. If there’s no reason for it to happen at runtime, why make it happen then? It is only going to make things slower.

Thanks for reading!

Links

You can find the source code for the code snippets in this post here: Github Atrix256/RandomCode/CompileTimeCRC/

Here’s a couple interesting discussions on constexpr on stack overflow:
detecting execution time of a constexpr function
How to ensure constexpr function never called at runtime?

Comments

comments

Posted in C++

Path Tracing – Getting Started With Diffuse and Emissive

The images below were path traced using 100,000 samples per pixel, using the sample code provided in this post.

Path tracing is a specific type of ray tracing that aims to make photo realistic images by solving something called the rendering equation. Path tracing relies heavily on a method called Monte Carlo integration to get an approximate solution. In fact, it’s often called “Monte Carlo Path Tracing”, but I’ll refer to it as just “Path Tracing”.

In solving the rendering equation, path tracing automatically gets many graphical “features” which are cutting edge research topics of real time graphics, such as soft shadows, global illumination, color bleeding and ambient occlusion.

Unfortunately, path tracing can take a very long time to render – like minutes, hours, or days even, depending on scene complexity, image quality desired and specific algorithms used. Despite that, learning path tracing can still be very useful for people doing real time rendering, as it can give you a “ground truth” to compare your real time approximating algorithms to, and can also help you understand the underlying physical processes involved, to help you make better real time approximations. Sometimes even, data is created offline using path tracing, and is “baked out” so that it can be a quick and simple lookup during runtime.

There is a lot of really good information out there about path tracing, walking you through the rendering equations, monte carlo integration, and the dozen or so relevant statistical topics required to do monte carlo integration.

While understanding that stuff is important if you really want to get the most out of path tracing, this series of blog posts is meant to be more informal, explaining more what to do, and less deeply about why to do it.

When and if you are looking for resources that go deeper into the why, I highly recommend starting with these!

Source Code

You can find the source code that goes along with this post here:

Github: Atrix256/RandomCode/PTBlogPost1/

You can also download a zip of the source code here:

PTBlogPost1.zip

The Rendering Equation

The rendering equation might look a bit scary at first but stay with me, it is actually a lot simpler than it looks.

L_o( \omega_o)= L_e(\omega_o)+\int_{\Omega}{f(\omega_i, \omega_o)L_i(\omega_i)(\omega_i \cdot n)\mathrm{d}\omega_i}

Here’s a simplified version:

LightOut(ViewDirection) = EmissiveLight(ViewDirection) + ReflectedLight(ViewDirection,AllDirections)

In other words, the light you see when looking at an object is made up of how much it glows in your direction (like how a lightbulb or a hot coal in a fireplace glows), and also how much light is reflected in your direction, from light that is hitting that point on the object from all other directions.

It’s pretty easy to know how much an object is glowing in a particular direction. In the sample code, I just let a surface specify how much it glows (an emissive color) and use that for the object’s glow at any point on the surface, at any viewing angle.

The rest of the equation is where it gets harder. The rest of the equation is this:

\int_{\Omega}{f(\omega_i, \omega_o)L_i(\omega_i)(\omega_i \cdot n)\mathrm{d}\omega_i}

The integral (the \int_{\Omega} at the front and the \mathrm{d}\omega_i at the back) just means that we are going to take the result of everything between those two symbols, and add them up for every point in a hemisphere, multiplying each value by the fractional size of the point’s area for the hemisphere. The hemisphere we are talking about is the positive hemisphere surrounding the normal of the surface we are looking at.

One of the reasons things get harder at this point is that there are an infinite number of points on the hemisphere.

Let’s ignore the integral for a second and talk about the rest of the equation. In other words, lets consider only one of the points on the hemisphere for now.

  • f(\omega_i, \omega_o) – This is the “Bidirectional reflectance distribution function”, otherwise known as the BRDF. It describes how much light is reflected towards the view direction, of the light coming in from the point on the hemisphere we are considering.
  • L_i(\omega_i) – This is how much light is coming in from the point on the hemisphere we are considering.
  • (\omega_i \cdot n) – This is the cosine of the angle between the point on the hemisphere we are considering and the surface normal, gotten via a dot product. What this term means is that as the light direction gets more perpendicular to the normal, light is reflected less and less. This is based on the actual behavior of light and you can read more about it here if you want to: Wikipedia: Lambert’s Cosine Law. Here is another great link about it lambertian.pdf. (Thanks for the link Jay!)

So what this means is that for a specific point on the hemisphere, we find out how much light is coming in from that direction, multiply it by how much the BRDF says light should be reflected towards the view direction from that direction, and then apply Lambert’s cosine law to make light dimmer as the light direction gets more perpendicular with the surface normal (more parallel with the surface).

Hopefully that makes enough sense.

Bringing the integral back into the mix, we have to sum up the results of that process for each of the infinite points on the hemisphere, multiplying each value by the fractional size of the point’s area for the hemisphere. When we’ve done that, we have our answer, and have our final pixel color.

This is where Monte Carlo integration comes in. Since we can’t really sum the infinite points, multiplied by their area (which is infinitely small), we are instead going to take a lot of random samples of the hemisphere and average them together. The more samples we take, the closer we get to the actual correct value. Not too hard right?

Now that we have the problem of the infinite points on the hemisphere solved, we actually have a second infinity to deal with.

The light incoming from a specific direction (a point on the hemisphere) is just the amount of light leaving a different object from that direction. So, to find out how much light is coming in from that direction, you have to find what object is in that direction, and calculate how much light is leaving that direction for that object. The problem is, the amount of light leaving that direction for that object is in fact calculated using the rendering equation, so it in turn is based on light leaving a different object and so on. It continues like this, possibly infinitely, and even possibly in loops, where light bounces between objects over and over (like putting two mirrors facing eachother). We have possibly infinite recursion!

The way this is dealt with in path tracers is to just limit the maximum amount of bounces that are considered. Higher numbers of bounces gives diminishing returns in general, so usually it’s just the first couple of bounces that really make a difference. For instance, the images at the top of this post were made using 5 bounces.

The Algorithm

Now that we know how the rendering equation works, and what we need to do to solve it, let’s write up the algorithm that we perform for each pixel on the screen.

  1. First, we calculate the camera ray for the pixel.
  2. If the camera ray doesn’t hit an object, the pixel is black.
  3. If the camera ray does hit an object, the pixel’s color is determined by how much light that object is emitting and reflecting down the camera ray.
  4. To figure out how much light that is, we choose a random direction in the hemisphere of that object’s normal and recurse.
  5. At each stage of the recursion, we return: EmittedLight + 2 * RecursiveLight * Dot(Normal, RandomHemisphereAngle) * SurfaceDiffuseColor.
  6. If we ever reach the maximum number of bounces, we return black for the RecursiveLight value.

We do the above multiple times per pixel and average the results together. The more times we do the process (the more samples we have), the closer the result gets to the correct answer.

By the way, the multiplication by 2 in step five is a byproduct of some math that comes out of integrating the BRDF. Check the links i mentioned at the top of the post for more info, or you can at least verify that I’m not making it up by seeing that wikipedia says the same thing. There is also some nice psuedo code there! (:
Wikipedia: Path Tracing: Algorithm

Calculating Camera Rays

There are many ways to calculate camera rays, but here’s the method I used.

In this post we are going to path trace using a pin hole camera. In a future post we’ll switch to using a lens to get interesting lens effects like depth of field.

To generate rays for our pin hole camera, we’ll need an eye position, a target position that the eye is looking at, and an imaginary grid of pixels between the eye and the target.

This imaginary grid of pixels is what is going to be displayed on the screen, and may be thought of as the “near plane”. If anything gets between the eye and the near plane it won’t be visible.

To calculate a ray for a pixel, we get the direction from the eye to the pixel’s location on the grid and normalize that. That gives us the ray’s direction. The ray’s starting position is just the pixel’s location on that imaginary grid.

I’ll explain how to do this below, but keep in mind that the example code also does this process, in case reading the code is easier than reading a description of the math used.

First we need to figure out the orientation of the camera:

  1. Calculate the camera’s forward direction by normalizing (Target – Eye).
  2. To calculate the camera’s right vector, cross product the forward vector with (0,1,0).
  3. To calculate the camera’s up vector, cross product the forward vector with the right vector.

Note that the above assumes that there is no roll (z axis rotation) on the camera, and that it isn’t looking directly up.

Next we need to figure out the size of our near plane on the camera’s x and y axis. To calculate this, we have to define both a near plane distance (I use 0.1 in the sample code) as well as a horizontal and vertical field of view (I use a FOV of 40 degrees both horizontally and vertically, and make a square image, in the sample code).

You can get the size of the window on each axis then like this:

  1. WindowRight = tangent(HorizontalFOV / 2) * NearDistance
  2. WindowTop = tangent(VerticalFOV / 2) * NearDistance

Note that we divide the field of view by 2 on each axis because we are going to treat the center of the near plane as (0,0). This centers the near plane on the camera.

Next we need to figure out where our pixel’s location is in world space, when it is at pixel location (x,y):

  1. Starting at the camera’s position, move along the camera’s forward vector by whatever your near plane distance is (I use a value of 0.1). This gets us to the center of the imaginary pixel grid.
  2. Next we need to figure out what percentage on the X and Y axis our pixel is. This will tell us what percentage on the near plane it will be. We divide x by ScreenWidth and y by ScreenHeight. We then put these percentages in a [-1,1] space by multiplying the percentages by 2 and subtracting 1. We will call these values u and v, which equate to the x and y axis of the screen.
  3. Starting at the center of the pixel grid that we found, we are going to move along the camera’s right vector a distance of u and the camera’s up vector a distance of v.

We now have our pixel’s location in the world.

Lastly, this is how we calculate the ray’s position and direction:

  1. RayDir = normalize(PixelWorld – Eye)
  2. RayStart = PixelWorld

We now have a ray for our pixel and can start solving eg ray vs triangle equations to see what objects in the scene our ray intersects with.

That’s basically all there is to path tracing at a high level. Next up I want to talk about some practical details of path tracing.

Rendering Parameters, Rendering Quality and Rendering Time

A relevant quote from @CasualEffects:

Below are a few scenes rendered at various samples per pixel (spp) and their corresponding rendering times. They are rendered at 128×128 with 5 bounces. I used 8 threads to utilize my 8 cpu cores. Exact machine specs aren’t really important here, I just want to show how sample count affects render quality and render times in a general way.




There’s a couple things worth noting.

First, render time grows roughly linearly with the number of samples per pixel. This lets you have a pretty good estimate how long a render will take then, if you know how long it took to render 1000 samples per pixel, and now you want to make a 100,000 samples per pixel image – it will take roughly 100 times as long!

Combine that with the fact that you need 4 times as many samples to half the amount of error (noise) in an image and you can start to see why monte carlo path tracing takes so long to render nice looking images.

This also applies to the size of your render. The above were rendered at 128×128. If we instead rendered them at 256×256, the render times would actually be four times as long! This is because our image would have four times as many pixels, which means we’d be taking four times as many samples (at the same number of samples per pixel) to make an image of the same quality at the higher resolution.

You can affect rendering times by adjusting the maximum number of bounces allowed in your path tracer as well, but that is not as straightforward of a relationship to rendering time. The rendering time for a given number of bounces depends on the complexity and geometry of the scene, so is very scene dependent. One thing you can count on though is that decreasing the maximum number of bounces will give you the same or faster rendering times, while increasing the maximum number of bounces will give you the same or slower rendering times.

Something else that’s really important to note is that the first scene takes a lot more samples to start looking reasonable than the second scene does. This is because there is only one small light source in the first image but there is a lot of ambient light from a blue sky in the second image. What this means is that in the first scene, many rays will hit darkness, and only a few will hit light. In the second scene, many rays will hit either the small orb of light, or will hit the blue sky, but all rays will hit a light source.

The third scene also takes a lot more samples to start looking reasonable compared to the fourth scene. This is because in the third scene, there is a smaller, brighter light in the middle of the ceiling, while in the fourth scene there is a dimmer but larger light that covers the entire ceiling. Again, in the third scene, rays are less likely to hit a light source than in the fourth scene.

Why do these scenes converge at different rates?

Well it turns out that the more difference there is in the things your rays are likely to hit (this difference is called variance, and is the source of the “noisy look” in your path tracing), the longer it takes to converge (find the right answer).

This makes a bit of sense if you think of it just from the point of view of taking an average.

If you have a bunch of numbers with an average of N, but the individual numbers vary wildly from one to the next, it will take more numbers averaged together to get closer to the actual average.

If on the other hand, you have a bunch of numbers with an average of N that aren’t very different from eachother (or very different from N), it will take fewer numbers to get closer to the actual average.

Taken to the extreme, if you have a bunch of numbers with an average of N, that are all exactly the value N, it only takes one sample to get to the actual true average of N!

You can read a discussion on this here: Computer Graphics Stack Exchange: Is it expected that a naive path tracer takes many, many samples to converge?

There are lots of different ways of reducing variance of path tracing in both the sampling, as well as in the final image.

Some techniques actually just “de-noise” the final image rendered with lower sample counts. Some techniques use some extra information about each pixel to denoise the image in a smarter way (such as using a Z buffer type setup to do bilateral filtering).

Here’s such a technique that has really impressive results. Make sure and watch the video!

Nonlinearly Weighted First-order Regression for Denoising Monte Carlo Renderings

There is also a nice technique called importance sampling where instead of shooting the rays out in a uniform random way, you actually shoot your rays in directions that matter more and will get you to the actual average value faster. Importance sampling lets you get better results with fewer rays.

In the next post or two, I’ll show a very simple importance sampling technique (cosine weighted hemisphere sampling) and hope in the future to show some more sophisticated importance sampling methods.

Some Other Practical Details

Firstly, I want to mention that this is called “naive” path tracing. There are ways to get better images in less time, and algorithms that are better suited for different scenes or different graphical effects (like fog or transparent objects), but this is the most basic, and most brute force way to do path tracing. We’ll talk about some ways to improve it and get some more features in future posts.

Hitting The Wrong Objects

I wanted to mention that when you hit an object and you calculate a random direction to cast a new ray in, there’s some very real danger that the new ray you cast out will hit the same object you just hit! This is due to the fact that you are starting the ray at the collision point of the object, and the floating point math may or may not consider the new ray to hit the object at time 0 or similar.

One way to deal with this is to move the ray’s starting point a small amount down the ray direction before testing the ray against the scene. If pushed out far enough (I use a distance of 0.001) it will make sure the ray doesn’t hit the same object again. It sounds like a dodgy thing to do, because if you don’t push it out enough (how far is enough?) it will give you the wrong result, and if you push it out too far to be safe, you can miss thin objects, or can miss objects that are very close together. In practice though, this is the usual solution to the problem and works pretty well without too much fuss. Note that this is a problem in all ray tracing, not just path tracing, and this solution of moving the ray by a small epsilon is common in all forms of ray tracing I’ve come across!

Another way to deal with the problem is to give each object a unique ID and then when you cast a ray, tell it to ignore the ID of the object you just hit. This works flawlessly in practice, so long as your objects are convex – which is usually the case for ray tracing since you often use triangles, quads, spheres, boxes and similar to make your scene. However, this falls apart when you are INSIDE of a shape (like how the images at the top of this post show objects INSIDE a box), and it also falls apart when you have transparent objects, since sometimes it is appropriate for an object to be intersected again by a new ray.

I used to be a big fan of object IDs, but am slowly coming to terms with just pushing the ray out a little bit. It’s not so bad, and handles transparents and being inside an object (:

Gamma Correction

After we generate our image, we need to apply gamma correction by taking each color channel to the power of 1/2.2. A decent approximation to that is also to just take the square root of each color channel, as the final value for that color channel. You can read about why we do this here: Understanding Gamma Correction

HDR vs LDR

There is nothing in our path tracer that has any limitations on how bright something can be. We could have a bright green light that had a color value of (0, 100000, 0)! Because of this, the final pixel color may not necessarily be less than one (but it will be a positive number). Our color with be a “High Dynamic Range” color aka HDR color. You can use various tone mapping methods to turn an HDR color into an LDR color – and we will be looking at that in a future post – but for now, I am just clamping the color value between 0 and 1. It’s not the best option, but it works fine for our needs right now.

Divide by Pi?

When looking at explanations or implementations of path tracing, you’ll see that some of them divide colors by pi at various stages, and others don’t. Since proper path tracing is very much about making sure you have every little detail of your math correct, you might wonder whether you should be dividing by pi or not, and if so, where you should do that. The short version is it actually doesn’t really matter believe it or not!

Here are two great reads on the subject for a more in depth answer:
PI or not to PI in game lighting equation
Adopting a physically based shading model

Random Point on Sphere and Bias

Correctly finding a uniformly random point on a sphere or hemisphere is actually a little bit more complicated that you might expect. If you get it wrong, you’ll introduce bias into your images which will make for some strange looking things.

Here is a good read on some ways to do it correctly:
Wolfram Math World: Sphere Point Picking

Here’s an image where the random point in sphere function was just completely wrong:

Here’s an image where the the random hemisphere function worked by picking a random point in a cube and normalizing the resulting vector (and multiplying it by -1 if it was in the wrong hemisphere). That gives too much weighting to the corners, which you can see manifests itself in the image on the left as weird “X” shaped lighting (look at the wall near the green light), instead of on the right where the lighting is circular. Apologies if it’s hard to distinguish. WordPress is converting all my images to 8BPP! :X

Primitive Types & Counts Matter

Here are some render time comparisons of the Cornell box scene rendered at 512×512, using 5 bounces and 100 samples per pixel.

There are 3 boxes which only have 5 sides – the two boxes inside don’t have a bottom, and the box containing the boxes doesn’t have a front.

I started out by making the scene with 30 triangles, since it takes 2 triangles to make a quad, and 5 quads to make a box, and there are 3 boxes.

Those 30 triangles rendered in 21.1 seconds.

I then changed it from 30 triangles to 15 quads.

It then took 6.2 seconds to render. It cut the time in half!

This is not so surprising though. If you look at the code for ray vs triangle compared to ray vs quad, you see that ray vs quad is just a ray vs triangle test were we first test the “diagonal line” of the quad, to see which of the 2 corners should be part of the ray vs triangle test. Because of this, testing a quad is just about as fast as testing a triangle. Since using quads means we have half the number of primitives, turning our triangles into quads means our rendering time is cut in half.

Lastly, i tried turning the two boxes inside into oriented boxes that have a width, height, depth, an axis of rotation and an angle of rotation around that axis. The collision test code puts the ray into the local space of the oriented box and then just does a ray vs axis aligned box test.

Doing that, i ended up with 5 quads (for the box that doesn’t have a front, it needed to stay quads, unless i did back face culling, which i didn’t want to) and two oriented boxes.

The render time for that was 5.5 seconds, so it did shave off 0.7 seconds, which is a little over 11% of the rendering time. So, it was worth while.

For such a low number of primitives, I didn’t bother with any spatial acceleration structures, but people do have quite a bit of luck on more complex scenes with bounding volume hierarchies (BVH’s).

For naive path tracing code, since the first ray hit is entirely deterministic which object it will hit (if any), we could also cache that first intersection and re-use it for each subsequent sample. That ought to make a significant difference to rendering times, but basically in the next path tracing post we’ll be removing the ability to do that, so I didn’t bother to try it and time it.

Debugging

As you are working on your path tracer, it can be useful to render an image at a low number of samples so that it’s quick to render and you can see whether things are turning out the way you want or not.

Another option is to have it show you the image as it’s rendering more and more samples per pixel, so that you can see it working.

If you make a “real time” visualizer like that, some real nice advice from Morgan McGuire (Computer graphics professor and the author of http://graphicscodex.com/) is to make a feature where if you click on a pixel, it re-renders just that pixel, and does so in a single thread so that you can step through the process of rendering that pixel to see what is going wrong.

I personally find a lot of value in visualizing per-pixel values in the image to see how values look across the pixels to be able to spot problems. You can do this by setting the emissive lighting to be based on the value you want to visualize and setting the bounce count to 1, or other similar techniques.

Below are two debug images I made while writing the code for this post to try and understand how some things were going wrong. The first image shows the normals of the surface that the camera rays hit (i put x,y,z of the normal into r,g,b but multiply the values by 0.5 and add 0.5 to make them be 0 to 1, instead of -1 to 1). This image let me see that the normals coming out of my ray vs oriented box test were correct.

The second image shows the number of bounces done per pixel. I divided the bounce count by the maximum number of bounces and used that as a greyscale value for the pixel. This let me see that rays were able to bounce off of oriented boxes. A previous image that I don’t have anymore showed darker sides on the boxes, which meant that the ray bouncing wasn’t working correctly.

Immensely Parallelizable: CPU, GPU, Grid Computing

At an image size of 1024×1024, that is a little over 1 million pixels.

At 1000 samples per pixel, that means a little over 1 billion samples.

Every sample of every pixel only needs one thing to do it’s work: Read only access to the scene.

Since each of those samples are able to do their work independently, if you had 1 billion cores, path tracing could use them all!

The example code is multithreaded and uses as many threads as cores you have on your CPU.

Since GPUs are meant for massively parallelizable work, they can path trace much faster than CPUs.

I haven’t done a proper apples to apples test, but evidence indicates something like a 100x speed up on the GPU vs the CPU.

You can also distribute work across a grid of computers!

One way to do path tracing in grid computing would be to have each machine do a 100 sample image, and then you could average all of those 100 sample images together to get a much higher sample image.

The downside to that is that network bandwidth and disk storage pays the cost of the full size image for each computer you have helping.

A better way to do it would probably be to make each machine do all of the samples for a small portion of the image and then you can put the tiles together at the end.

While decreasing network bandwidth and disk space usage, this also allows you to do all of the pixel averaging in floating point, as opposed to doing it in 8 bit 0-255 values like you’d get from a bmp file made on another machine.

Closing

In this post and the sample code that goes with it, we are only dealing with a purely diffuse (lambertian) surface, and emissive lighting. In future posts we’ll cover a lot more features like specular reflection (mirror type reflection), refraction (transparent objects), caustics, textures, bump mapping and more. We’ll also look at how to make better looking images with fewer samples and lots of other things too.

I actually have to confess that I did a bit of bait and switch. The images at the top of this post were rendered with an anti aliasing technique, as well as an importance sampling technique (cosine weighted hemisphere samples) to make the image look better faster. These things are going to be the topics of my next two path tracing posts, so be on the lookout! Here are the same scenes with the same number of samples, but with no AA, and uniform hemisphere sampling:

And the ones at the top for comparison:

While making this post I received a lot of good information from the twitter graphics community (see the people I’m following and follow them too! @Atrix256) as well as the Computer Graphics Stack Exchange.

Also, two specific individuals helped me out quite a bit:

@lh0xfb – She was also doing a lot of path tracing at the time and helped me understand where some of my stuff was going wrong, including replicating my scenes in her renderer to be able to compare and contrast with. I was sure my renderer was wrong, because it was so noisy! It turned out that while i tended to have small light sources and high contrast scenes, Lauren did a better job of having well lit scenes that converged more quickly.

@Reedbeta – He was a wealth of information for helping me understand some details about why things worked the way they did and answered a handful of graphics stack exchange questions I posted.

Thanks a bunch to you both, and to everyone else that participated in the discussions (:

Comments

comments

Incremental Averaging

This is a super short post about something I want to be able to reference again in the future (:

Let’s say that you need to average N things, where you don’t know what the final N will be, but you want to keep an average as you go.

For instance, let’s say you are doing monte carlo path tracing rendering, where the pixel you are showing is the average of however many samples you’ve had so far, but you are continuing to get new samples and want to show the updated average as you get new samples.

The formula for doing this is super simple.

NewAverage = OldAverage + (NewValue - OldAverage) / NewSampleCount;

// Or:

Average += (NewValue - Average) / NewSampleCount;

One way of thinking of the above equations is that you are adjusting the average by how much the new value would adjust the average.

Another way of thinking of the above two equations is this:

First figure out how far the new sample is from the average, then move towards that new amount by an ever decreasing amount, as the number of samples grow.

Because of this, if you are in a language such as glsl or hlsl which has linear interpolation built in (mix in glsl, lerp in hlsl), you can use linear interpolation as well:

Average = mix(Average, NewValue, 1.0 / NewSampleCount);

To see how this works out, check out the formula for lerp:

Lerp(a,b,t) = a + (b-a)*t

Substituting Average for a, NewValue for b, and 1/NewSampleCount for t, we get this:

Lerp(Average, NewValue, 1/NewSampleCount)
= Average + (NewValue-Average)/NewSampleCount

Which is the exact same formula as above. So, using lerp in that way is mathematically equivalent.

Here’s a link with more discussion on this:
Math Stack Exchange: Incremental averageing

Here’s an awesome post on this same subject, with a different resulting formula, by Christer Ericson (@ChristerEricson), author of the famous “Real Time Collision Detection” book, and VP of technology at Activision. His solution looks like it might be more numerically robust, but I haven’t analyzed it well enough yet to know for sure:
Robustly computing the centroid for a point set

A somewhat related topic, here’s a method for keeping a sum with floating point numbers, that is more accurate than normal summation. Useful when you need to sum numbers of different magnitudes, which would normally be a problem in floating point.
Wikipedia: Kahan Summation

Comments

comments

Understanding The Discrete Fourier Transform

I’ve been working on getting a better understanding of the Discrete Fourier Transform. I’ve figured out some things which have really helped my intuition, and made it a lot simpler in my head, so I wanted to write these up for the benefit of other folks, as well as for my future self when I need a refresher.

The discrete Fourier transform takes in data and gives out the frequencies that the data contains. This is useful if you want to analyze data, but can also be useful if you want to modify the frequencies then use the inverse discrete Fourier transform to generate the frequency modified data.

Multiplying By Sinusoids (Sine / Cosine)

If you had a stream of data that had a single frequency in it at a constant amplitude, and you wanted to know the amplitude of that frequency, how could you figure that out?

One way would be to make a cosine wave of that frequency and multiply each sample of your wave by the corresponding samples.

For instance, let’s say that we have a data stream of four values that represent a 1hz cosine wave: 1, 0, -1, 0.

We could then multiply each point by a corresponding point on a cosine wave, add sum them together:

We get.. 1*1 + 0*0 + -1*-1 + 0*0 = 2.

The result we get is 2, which is twice the amplitude of the data points we had. We can divide by two to get the real amplitude.

To show that this works for any amplitude, here’s the same 1hz cosine wave data with an amplitude of five: 5, 0, -5, 0.

Multiplying by the 1hz cosine wave, we get… 5*1 + 0*0 + -5*-1 + 0*0 = 10.

The actual amplitude is 5, so we can see that our result was still twice the amplitude.

In general, you will need to divide by N / 2 where N is the number of samples you have.

What happens if our stream of data has something other than a cosine wave in it though?

Let’s try a sine wave: 0, 1, 0, -1

When we multiply those values by our cosine wave values we get: 0*1 + 1*0 + 0*-1 + -1*0 = 0.

We got zero. Our method broke!

In this case, if instead of multiplying by cosine, we multiply by sine, we get what we expect: 0*0 + 1*1 + 0*0 + -1*-1 = 2. We get results consistent with before, where our answer is the amplitude times two.

That’s not very useful if we have to know whether we are looking for a sine or cosine wave though. We might even have some other type of wave that is neither one!

The solution to this problem is actually to multiply the data points by both cosine and sine waves, and keep both results.

Let’s see how that works.

For this example We’ll take a cosine wave and shift it 0.25 radians to give us samples: 0.97, -0.25, -0.97, 0.25. (The formula is cos(x*2*pi/4+0.25) from 0 to 3)

When we multiply that by a cosine wave we get: 0.97*1 + -0.25*0 + -0.97*-1 + 0.25*0 = 1.94
When we multiply it by a sine wave we get: 0.97*0 + -0.25*1 + -0.97*0 + 0.25*-1 = -0.5

Since we know both of those numbers are twice the actual amplitude, we can divide them both by two to get:
cosine: 0.97
sine: -0.25

Those numbers kind of makes sense if you think about it. We took a cosine wave and shifted it over a little bit so our combined answer is that it’s mostly a cosine wave, but is heading towards being a sine wave (technically a sine wave that has an amplitude of -1, so is a negative sine wave, but is still a sine wave).

To get the amplitude of our phase shifted wave, we treat those values as a vector, and get the magnitude. sqrt((0.97*0.97)+(-0.25*-0.25)) = 1.00. That is correct! Our wave’s amplitude is 1.0.

To get the starting angle (phase) of the wave, we use inverse tangent of sine over cosine. We want to take atan2(sine, cosine), aka atan2(-0.25, 0.97) which gives us a result of -0.25 radians. That’s the amount that we shifted our wave, so it is also correct!

Formulas So Far

Let’s take a look at where we are at mathematically. For N samples of data, we multiply each sample of data by a sample from a wave with the frequency we are looking for and sum up the results (Note that this is really just a dot product between N dimensional vectors!). We have to do this both with a cosine wave and a sine wave, and keep the two resulting values to be able to get our true amplitude and phase information.

For now, let’s drop the “divide by two” part and look at the equations we have.

Value_{cosine} = \sum\limits_{n=0}^{N-1} Sample_n * Cos(2\pi*Frequency*n/N)

Value_{sine} = \sum\limits_{n=0}^{N-1} Sample_n * Sin(2\pi*Frequency*n/N)

Those equations above do exactly what we worked through in the samples.

The part that might be confusing is the part inside of the Cos() and Sin() so I’ll explain that real quick.

Looking at the graph cos(x) and sin(x) you’ll see that they both make one full cycle between the x values of 0 and 2*pi (aprox. 6.28):

If instead we change it to a graph of cos(2*pi*x) and sin(2*pi*x) you’ll notice that they both make one full cycle between the x values of 0 and 1:

This means that we can think of the wave forms in terms of percent (0 to 1) instead of in radians (0 to 2pi).

Next, we multiply by the frequency we are looking for. We do this because that multiplication makes the wave form repeat that many times between 0 and 1. It makes us a wave of the desired frequency, still remaining in “percent space” where we can go from 0 to 1 to get samples on our cosine and sine waves.

Here’s a graph of cos(2*pi*x*2) and sin(2*pi*x*2), to show some 2hz frequency waves:

Lastly, we multiply by n/N. In our sum, n is our index variable and it goes from 0 to N-1. This is very much like a for loop. n/N is our current percent done we are in the for loop, so when we multiply by n/N, we are just sampling at a different location (by percentage done) on our cosine and sine waves that are at the specified frequencies.

Not too difficult right?

Imaginary Numbers

Wouldn’t it be neat if instead of having to do two separate calculations for our cosine and sine values, we could just do a single calculation that would give us both values?

Well, interestingly, we can! This is where imaginary numbers come in. Don’t get scared, they are here to make things simpler and more convenient, not to make things more complicated or harder to understand (:

There is something called Euler’s Identity which states the below:

e^{ix} = cos(x) + i*sin(x)

That looks pretty useful for our usage case doesn’t it? If you notice that our equations both had the same parameters for cos and sin, that means that we can just use this identity instead!

We can now take our two equations and combine them into a single equation:

Value = \sum\limits_{n=0}^{N-1} Sample_n * e^{i *2\pi * Frequency*n/N}

When we do the above calculation, we will get a complex number out with a real and imaginary part. The real part of the complex number is just the cosine value, while the imaginary part of the complex number is the sine value. Nothing scary has happened, we’ve just used/abused complex numbers a bit to give us what we want in a more compact form.

Multiple Frequencies

Now we know how to look for a single, specific frequency.

What if you want to look for multiple frequencies though? Further more, what if you don’t even know what frequencies to look for?

When doing the discrete Fourier transform on a stream of samples, there are a specific set of frequencies that it checks for. If you have N samples, it only checks for N different frequencies. You could ask about other frequencies, but these frequencies are enough to reconstruct the original signal from only the frequency data.

The first N/2 frequencies are going to be the frequencies 0 (DC) through N/2. N/2 is the Nyquist frequency and is the highest frequency that your signal is able of holding.

After N/2 comes the negative frequencies, counting from a large negative frequency (N/2-1) up to the last frequency which is -1.

As an example, if you had 8 samples and ran the DFT, you’d get 8 complex numbers as outputs, which represent these frequencies:

[0hz, 1hz, 2hz, 3hz, 4hz, -3hz, -2hz, -1hz]

The important take away from this section is that if you have N samples, the DFT asks only about N very specific frequencies, which will give enough information to go from frequency domain back to time domain and get the signal data back.

That then leads to this equation below, where we just put a subscript on Value, to signify which frequency we are probing for.

Value_{Frequency} = \sum\limits_{n=0}^{N-1} Sample_n * e^{2\pi i*Frequency*n/N}

Frequency is an integer, and is between 0 and N-1. We can say that mathematically by listing this information along with the equation:

Frequency \in [0,N), Frequency \in \mathbb Z

Final Form

Math folk use letters and symbols instead of words in their equations.

The letter k is used instead of frequency.

Instead of Value_{Frequency} , the symbol is X_{k} .

Instead of Sample_n , the symbol is x_n.

This gives us a more formal version of the equation:

X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}

k\in [0,N), k \in \mathbb Z

We are close to the final form, but we aren’t quite done yet!

Remember the divide by two? I had us take out the divide by two so that I could re-introduce it now in it’s correct form. Basically, since we are querying for N frequencies, we need to divide each frequency’s result by N. I mentioned that we had to divide by N/2 to make the amplitude information correct, but since we are checking both positive AND negative frequencies, we have to divide by 2*(N/2), or just N. That makes our equation become this:

X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}

k\in [0,N), k \in \mathbb Z

Lastly, we actually want to make the waves go backwards, so we make the exponent negative. The reason for this is a deeper topic than I want to get into right now, but you can read more about why here if you are curious: DFT – Why are the definitions for inverse and forward commonly switched?

That brings us to our final form of the formula for the discrete Fourier transform:

X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{-2\pi ikn/N}

k\in [0,N), k \in \mathbb Z

Taking a Test Drive

If we use that formula to calculate the DFT of a simple, 4 sample, 1hz cosine wave (1, 0, -1, 0), we get as output [0, 0.5, 0, -0.5].

That means the following:

  • 0hz : 0 amplitude
  • 1hz : 0.5 amplitude
  • 2hz : 0 amplitude
  • -1hz : -0.5 amplitude

It is a bit strange that the simple 1hz, 1.0 amplitude cosine wave was split in half, and made into a 1hz cosine wave and -1hz cosine wave, each contributing half amplitude to the result, but that is “just how it works”. I don’t have a very good explanation for this though. My personal understanding just goes that if we are checking for both positive AND negative frequencies, that makes it ambiguous whether it’s a 1hz wave with 1 amplitude, or -1hz wave with -1 amplitude. Since it’s ambiguous, both must be true, and they get half the amplitude each. If I get a better understanding, I’ll update this paragraph, or make a new post on it.

Making The Inverse Discrete Fourier Transform

Making the formula for the Inverse DFT is really simple.

We start with our DFT formula, drop the 1/N from the front, and also make the exponent to e positive instead of negative. That is all! The intuition here for me is that we are just doing the reverse of the DFT process, to do the inverse DFT process.

x_k= \sum\limits_{n=0}^{N-1} X_n * e^{2\pi ikn/N}

k\in [0,N), k \in \mathbb Z

While the DFT takes in real valued signals and gives out complex valued frequency data, the IDFT takes in complex valued frequency data, and gives out real valued signals.

A fun thing to try is to take some data, DFT it, modify the frequency data, then IDFT it to see what comes out the other side.

Other Notes

Here are some other things of note about the discrete Fourier transform and it’s inverse.

Data Repeating Forever

When you run the DFT on a stream of data, the math is such that it assumes that the stream of data you gave it repeats forever both forwards and backwards in time. This is important because if you aren’t careful, you can add frequency content to your data that you didn’t intend to be there.

For instance, if you tile a 1hz sine wave, it’s continuous, and there is only the 1hz frequency present:

However, if you tile a 0.9hz sine wave, there is a discontinuity, which means that there will be other, unintended frequencies present when you do a DFT of the data, to be able to re-create the discontinuity:

Fast Fourier Transform

There are a group of algorithms called “Fast Fourier Transforms”. You might notice that if we have N samples, taking the DFT is an O(N^2) operation. Fast Fourier transforms can bring it down to O(N log N).

DFT / IDFT Formula Variations

The formula we came up with is one possible DFT formula, but there are a handful of variations that are acceptable, even though different variations come up with different values!

The first option is whether to make the e exponent negative or not. Check out these two formulas to see what I mean.

X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}
vs
X_k= \sum\limits_{n=0}^{N-1} x_n * e^{-2\pi ikn/N}

Either one is acceptable, but when providing DFT’d data, you should mention which one you did, and make sure and use the opposite one in your inverse DFT formula.

The next options is whether to divide by N or not, like the below:

X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}
vs
X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}

Again, either one is acceptable, but you need to make sure and let people know which one you did, and also make sure and use the opposite one in your inverse DFT formula.

Instead of dividing by N, some people actually divide by 1/sqrt(N) in both the DFT and inverse DFT. Wolfram alpha does this for instance!

You can read more about this stuff here: DFT – Why are the definitions for inverse and forward commonly switched?

One thing to note though is that if doing 1/N on the DFT (my personal preference), the 0hz (DC) frequency bin gives you the average value of the signal, and the amplitude data you get out of the other bins is actually correct (keeping in mind the amplitudes are split in half between positive and negative frequencies).

Why Calculate Negative Frequencies

The negative frequencies are able to be calculated on demand from the positive frequency information (eg complex conjugation), so why should we even bother calculating them? Sure it’d be more computationally efficient not to calculate them, especially when just doing frequency analysis, right?!

Here’s a discussion about that: Why calculate negative frequencies of DFT?

Higher Dimensions

It’s possible to DFT in 2d, 3d, and higher. The last blog post shows how to do this with 2d images, but I’d also like to write a blog post like this one specifically talking about the intuition behind multi dimensional DFTs.

Example Program Source Code

Here’s some simple C++ source code which calculates the DFT and inverse DFT, optionally showing work in case you want to try to work some out by hand to better understand this. Working a few out by hand really helped me get a better intuition for all this stuff.

Program Output:

Source Code:

#include <stdio.h>
#include <complex>
#include <vector>

// set to 1 to have it show you the steps performed.  Set to 0 to hide the work.
// useful if checking work calculated by hand.
#define SHOW_WORK 1

#if SHOW_WORK
    #define PRINT_WORK(...) printf(__VA_ARGS__)
#else
    #define PRINT_WORK(...)
#endif

#define CHAR_SIGMA 228
#define CHAR_PI 227

typedef float TRealType;
typedef std::complex<TRealType> TComplexType;

const TRealType c_pi = (TRealType)3.14159265359;
const TRealType c_twoPi = (TRealType)2.0 * c_pi;

//=================================================================================
TComplexType DFTSample (const std::vector<TRealType>& samples, int k)
{
    size_t N = samples.size();
    TComplexType ret;
    for (size_t n = 0; n < N; ++n)
    {
        TComplexType calc = TComplexType(samples[n], 0.0f) * std::polar<TRealType>(1.0f, -c_twoPi * TRealType(k) * TRealType(n) / TRealType(N));
        PRINT_WORK("    n = %i : (%f, %f)\n", n, calc.real(), calc.imag());
        ret += calc;
    }
    ret /= TRealType(N);
    PRINT_WORK("    Sum the above and divide by %i\n", N);
    return ret;
}

//=================================================================================
std::vector<TComplexType> DFTSamples (const std::vector<TRealType>& samples)
{
    PRINT_WORK("DFT:  X_k = 1/N %cn[0,N) x_k * e^(-2%cikn/N)\n", CHAR_SIGMA, CHAR_PI);

    size_t N = samples.size();
    std::vector<TComplexType> ret;
    ret.resize(N);
    for (size_t k = 0; k < N; ++k)
    {
        PRINT_WORK("  k = %i\n", k);
        ret[k] = DFTSample(samples, k);
        PRINT_WORK("  X_%i = (%f, %f)\n", k, ret[k].real(), ret[k].imag());
    }
    PRINT_WORK("\n");
    return ret;
}

//=================================================================================
TRealType IDFTSample (const std::vector<TComplexType>& samples, int k)
{
    size_t N = samples.size();
    TComplexType ret;
    for (size_t n = 0; n < N; ++n)
    {
        TComplexType calc = samples[n] * std::polar<TRealType>(1.0f, c_twoPi * TRealType(k) * TRealType(n) / TRealType(N));
        PRINT_WORK("    n = %i : (%f, %f)\n", n, calc.real(), calc.imag());
        ret += calc;
    }
    PRINT_WORK("    Sum the above and take the real component\n");
    return ret.real();
}

//=================================================================================
std::vector<TRealType> IDFTSamples (const std::vector<TComplexType>& samples)
{
    PRINT_WORK("IDFT:  x_k = %cn[0,N) X_k * e^(2%cikn/N)\n", CHAR_SIGMA, CHAR_PI);

    size_t N = samples.size();
    std::vector<TRealType> ret;
    ret.resize(N);
    for (size_t k = 0; k < N; ++k)
    {
        PRINT_WORK("  k = %i\n", k);
        ret[k] = IDFTSample(samples, k);
        PRINT_WORK("  x_%i = %f\n", k, ret[k]);
    }
    PRINT_WORK("\n");
    return ret;
}

//=================================================================================
template<typename LAMBDA>
std::vector<TRealType> GenerateSamples (int numSamples, const LAMBDA& lambda)
{
    std::vector<TRealType> ret;
    ret.resize(numSamples);
    for (int i = 0; i < numSamples; ++i)
    {
        TRealType percent = TRealType(i) / TRealType(numSamples);
        ret[i] = lambda(percent);
    }
    return ret;
}

//=================================================================================
int main (int argc, char **argv)
{
    // You can test specific data samples like this:
    //std::vector<TRealType> sourceData = { 1, 0, 1, 0 }; 
    //std::vector<TRealType> sourceData = { 1, -1, 1, -1 };

    // Or you can generate data samples from a function like this
    std::vector<TRealType> sourceData = GenerateSamples(
        4,
        [] (TRealType percent)
        {
            const TRealType c_frequency = TRealType(1.0);
            return cos(percent * c_twoPi * c_frequency);
        }
    );

    // Show the source data
    printf("\nSource = [ ");
    for (TRealType v : sourceData)
        printf("%f ",v);
    printf("]\n\n");

    // Do a dft and show the results
    std::vector<TComplexType> dft = DFTSamples(sourceData);
    printf("dft = [ ");
    for (TComplexType v : dft)
        printf("(%f, %f) ", v.real(), v.imag());
    printf("]\n\n");

    // Do an inverse dft of the dft data, and show the results
    std::vector<TRealType> idft = IDFTSamples(dft);
    printf("idft = [ ");
    for (TRealType v : idft)
        printf("%f ", v);
    printf("]\n");
    
    return 0;
}

Links

Explaining how to calculate the frequencies represented by the bins of output of DFT:
How do I obtain the frequencies of each value in an FFT?

Another good explanation of the Fourier transform if it isn’t quite sinking in yet:
An Interactive Guide To The Fourier Transform

Some nice dft calculators that also have inverse dft equivelants:
DFT Calculator 1
IDFT Calculator 1
DFT Calculator 2
IDFT Calculator 2

Wolfram alpha can also do DFT and IDFT, but keep in mind that the formula used there is different and divides the results by 1/sqrt(N) in both DFT and IDFT so will be different values than you will get if you use a different formula.

Wolfram Alpha: Fourier[{1, 0, -1, 0}] = [0,1,0,1]

Wolfram Alpha: Inverse Fourier[{0, 1 , 0, 1}] = [1, 0, -1, 0]

Comments

comments

Fourier Transform (And Inverse) Of Images

I attended SIGGRAPH this year and there were some amazing talks.

There were quite a few talks dealing with the Fourier transform of images and sampling patterns, signal frequencies and bandwidth so I feel compelled to write up a blog post about the Fourier transform and inverse Fourier transform of images, as a transition to some other things that I want to write up.

At the bottom of this post is the source code to the program i used to make the examples. It’s a single CPP file that does not link to any libraries, and does not include any non standard headers (besides windows.h). It should be super simple to copy, paste, compile and use!

Fourier Transform Overview

The Fourier transform converts data into the frequencies of sine and cosine waves that make up that data. Since we are going to be dealing with sampled data (pixels), we are going to be using the discrete Fourier transform.

After you perform the Fourier transform, you can run the inverse Fourier transform to get the original image back out.

You can also optionally modify the frequency data before running the inverse Fourier transform, which would give you an altered image as output.

In audio, a fourier transform is 1D, while with images, it’s 2D. That slows things down because a 1D Fourier transform is O(N^2) while a 2D Fourier transform is O(N^4) .

This is quite an expensive operation as you can see, but there are some things that can mitigate the issue:

  • The operation is separable on each axis. AKA only the naive implementation is O(N^4) .
  • There is something called “The Fast Fourier Transform” which can make a 1D fourier transform go from O(N^2) to O(N log N) time complexity.
  • Each pixel of output can be calculated without consideration of the other output pixels. The algorithm only needs read access to the source image or source data. This means that you can run this across however many cores you have on your CPU or GPU.

The items above are true of both the Fourier transform as well as the inverse Fourier transform.

The 2D Fourier transform takes a grid of REAL values as input of size MxN and returns a grid of COMPLEX values as output, also of size MxN.

The inverse 2D Fourier transform takes a grid of COMPLEX values as input, of size MxN and returns a grid of REAL values as output, also of size MxN.

The complex values are (of course!) made up of two components. You can get the amplitude of the frequency represented by the complex value by treating these components as a vector and getting the length. You can get the phase (angle that the frequency starts at) of the frequency by treating it like a vector and getting the angle it represents – like by using atan2(imaginary, real).

For more detailed information about the Fourier transform or the inverse Fourier transform, including the mathematical equations, please see the links at the end of this post!

Image Examples

I’m going to show some examples of images that have been Fourier transformed, modified, and inverse Fourier transformed back. This should hopefully give you a more intuitive idea of what this stuff is all about.

I’m working with the source images in grey scale so i only have to work with one color channel (more on how to do that here: Blog.Demofox.Org: Converting RGB to Grayscale). You could easily do this same stuff with color images, but you would need to work with each color channel individually.

Zelda Guy

Here is the old man from “The Legend of Zelda” who gives you the sword. This image is 84×80 which takes about 1.75 seconds to do a fourier or inverse fourier transform with my naiive implementation of unoptimized code.

Taking a fourier transform of the greyscale version of that image gives me the following frequency amplitude (first) and phase (second) information. Note that we put the frequency amplitude through a log function to make the lesser represented frequencies show up more visibly.


Note that the center of the image represents frequency 0, aka DC. As you move out from the center, you get to higher and higher frequencies.

If you put that information into the inverse Fourier transform, you get the original image back out (in greyscale of course):

What if we changed the phase information though? Here’s what it looks like if we set all the frequencies to start at phase (angle) 0 instead of the proper angles, and then do an inverse Fourier transform:

It came out to be a completely different image! It has all the right frequencies, but the image is completely unrecognizable due to us messing with the phase data.

Interestingly, while your eyes are good at noticing differences in phase, your ears are not. That means that if this was a sound, instead of an image, you wouldn’t even be able to tell the difference. Strange isn’t it?

Now let’s do a low pass filter to our data. In other words, we are going to zero out the amplitude of all frequencies that are above a certain amount. We are going to zero out the frequencies that are farther than 10% of the image diagonal radius. That makes our frequency information look like this:

If we run the inverse Fourier transform on it, we get this:

The image got blurier because the high frequencies were removed. The high frequencies represent the small details of the image.

We also got some “ringing artifacts” which are the things that look like halos around the old man. This is also due to removing high frequency details. The short explanation for this is that it is very difficult to add sinusoids of different amplitudes and frequencies together to make a flat surface. To do so, you need a lot of small high frequency waves to fill in the areas next to the round hum humps to flatten it out. It’s the same issue you see when trying to make a square wave with additive synthesis, if you’ve read any of my posts on audio synthesis.

Now let’s try a high pass filter, where we remove frequencies that are closer than 10% of the image diagonal radius. First is the frequency amplitude information, and then the resulting image:


The results look pretty strange. These are the high frequency details that the blurry image is missing!

You could use a high pass filter on an image to do edge detection. You could use a low pass filter on an image to remove high frequency details before making the image smaller to prevent the aliasing that can happen when making an image smaller.

Let’s look at some other images that have been given similar treatment.

SIGGRAPH

Here’s a picture of myself at SIGGRAPH with my friend Paul who I used to work with at inXile! The image is 100×133 and takes about 6.5-7 seconds to do a Fourier transform or an inverse Fourier transform.

Here is the Fourier transform and inverse Fourier transform:



Here is the low pass frequency info and inverse Fourier transform:


Here is the high pass:


And here is the zero phase:

Simple Images

Lastly, here are some simple images, along with their frequency magnitude and phases. Sorry that they are so small, but hopefully you get the idea.

Horizontal Stripes:


Horizontal Stripe:


Vertical Stripes:


Vertical Stripe:


Diagonal Stripe:


You might notice that the Fourier transform frequency amplitudes actually run perpendicular to the orientation of the stripes. Look for a post soon which makes use of this property (:

Example Code

Here is the code I used to generate the examples above.

If you pass this program the name of a 24 bit bmp file, it will generate and save the DFT, and also the inverse DFT to show that the image can survive a round trip. It will also do a low pass filter, high pass filter, and set the phase of all frequencies to zero, saving off both the frequency amplitude information, as well as the image generated from the frequency information for those operations.

The program below is written for clarity, not speed. In particular, the DFT and IDFT code is naively implemented so is O(N^4). To speed it up, it should be threaded, do the work on each axis separately, and also use a fast Fourier transform implementation.

#define _CRT_SECURE_NO_WARNINGS
 
#include <stdio.h>
#include <stdint.h>
#include <array>
#include <vector>
#include <complex>
#include <windows.h>  // for bitmap headers and performance counter.  Sorry non windows people!

const float c_pi = 3.14159265359f;
const float c_rootTwo = 1.41421356237f;

typedef uint8_t uint8;

struct SProgress
{
    SProgress (const char* message, int total) : m_message(message), m_total(total)
    {
        m_amount = 0;
        m_lastPercent = 0;
        printf("%s   0%%", message);

        QueryPerformanceFrequency(&m_freq);
        QueryPerformanceCounter(&m_start);
    }

    ~SProgress ()
    {
        // make it show 100%
        m_amount = m_total;
        Update(0);

        // show how long it took
        LARGE_INTEGER end;
        QueryPerformanceCounter(&end);
        float seconds = ((float)(end.QuadPart - m_start.QuadPart)) / m_freq.QuadPart;
        printf(" (%0.2f seconds)\n", seconds);
    }

    void Update (int delta = 1)
    {
        m_amount += delta;
        int percent = int(100.0f * float(m_amount) / float(m_total));
        if (percent <= m_lastPercent)
            return;

        m_lastPercent = percent;
        printf("%c%c%c%c", 8, 8, 8, 8);
        if (percent < 100)
            printf(" ");
        if (percent < 10)
            printf(" ");
        printf("%i%%", percent);
    }

    int m_lastPercent;
    int m_amount;
    int m_total;
    const char* m_message;

    LARGE_INTEGER m_start;
    LARGE_INTEGER m_freq;
};
 
struct SImageData
{
    SImageData ()
        : m_width(0)
        , m_height(0)
    { }
 
    long m_width;
    long m_height;
    long m_pitch;
    std::vector<uint8> m_pixels;
};

struct SImageDataComplex
{
    SImageDataComplex ()
        : m_width(0)
        , m_height(0)
    { }

    long m_width;
    long m_height;
    std::vector<std::complex<float>> m_pixels;
};
 
bool LoadImage (const char *fileName, SImageData& imageData)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "rb");
    if (!file)
        return false;
 
    // read the headers if we can
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
    if (fread(&header, sizeof(header), 1, file) != 1 ||
        fread(&infoHeader, sizeof(infoHeader), 1, file) != 1 ||
        header.bfType != 0x4D42 || infoHeader.biBitCount != 24)
    {
        fclose(file);
        return false;
    }
 
    // read in our pixel data if we can. Note that it's in BGR order, and width is padded to the next power of 4
    imageData.m_pixels.resize(infoHeader.biSizeImage);
    fseek(file, header.bfOffBits, SEEK_SET);
    if (fread(&imageData.m_pixels[0], imageData.m_pixels.size(), 1, file) != 1)
    {
        fclose(file);
        return false;
    }
 
    imageData.m_width = infoHeader.biWidth;
    imageData.m_height = infoHeader.biHeight;
 
    imageData.m_pitch = imageData.m_width*3;
    if (imageData.m_pitch & 3)
    {
        imageData.m_pitch &= ~3;
        imageData.m_pitch += 4;
    }
 
    fclose(file);
    return true;
}
 
bool SaveImage (const char *fileName, const SImageData &image)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "wb");
    if (!file) {
        printf("Could not save %s\n", fileName);
        return false;
    }
 
    // make the header info
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
 
    header.bfType = 0x4D42;
    header.bfReserved1 = 0;
    header.bfReserved2 = 0;
    header.bfOffBits = 54;
 
    infoHeader.biSize = 40;
    infoHeader.biWidth = image.m_width;
    infoHeader.biHeight = image.m_height;
    infoHeader.biPlanes = 1;
    infoHeader.biBitCount = 24;
    infoHeader.biCompression = 0;
    infoHeader.biSizeImage = image.m_pixels.size();
    infoHeader.biXPelsPerMeter = 0;
    infoHeader.biYPelsPerMeter = 0;
    infoHeader.biClrUsed = 0;
    infoHeader.biClrImportant = 0;
 
    header.bfSize = infoHeader.biSizeImage + header.bfOffBits;
 
    // write the data and close the file
    fwrite(&header, sizeof(header), 1, file);
    fwrite(&infoHeader, sizeof(infoHeader), 1, file);
    fwrite(&image.m_pixels[0], infoHeader.biSizeImage, 1, file);
    fclose(file);

    printf("%s saved\n", fileName);
    return true;
}
 
void ImageToGrey (const SImageData &srcImage, SImageData &destImage)
{
    destImage = srcImage;

    for (int x = 0; x < srcImage.m_width; ++x)
    {
        for (int y = 0; y < srcImage.m_height; ++y)
        {
            const uint8 *src = &srcImage.m_pixels[(y * srcImage.m_pitch) + x * 3];
            uint8 *dest = &destImage.m_pixels[(y * destImage.m_pitch) + x * 3];

            uint8 grey = uint8((float(src[0]) * 0.3f + float(src[1]) * 0.59f + float(src[2]) * 0.11f));
            dest[0] = grey;
            dest[1] = grey;
            dest[2] = grey;
        }
    }
}

std::complex<float> DFTPixel (const SImageData &srcImage, int K, int L)
{
    std::complex<float> ret(0.0f, 0.0f);

    for (int x = 0; x < srcImage.m_width; ++x)
    {
        for (int y = 0; y < srcImage.m_height; ++y)
        {
            // Get the pixel value (assuming greyscale) and convert it to [0,1] space
            const uint8 *src = &srcImage.m_pixels[(y * srcImage.m_pitch) + x * 3];
            float grey = float(src[0]) / 255.0f;

            // Add to the sum of the return value
            float v = float(K * x) / float(srcImage.m_width);
            v += float(L * y) / float(srcImage.m_height);
            ret += std::complex<float>(grey, 0.0f) * std::polar<float>(1.0f, -2.0f * c_pi * v);
        }
    }

    return ret;
}
 
void DFTImage (const SImageData &srcImage, SImageDataComplex &destImage)
{
    // NOTE: this function assumes srcImage is greyscale, so works on only the red component of srcImage.
    // ImageToGrey() will convert an image to greyscale.

    // size the output dft data
    destImage.m_width = srcImage.m_width;
    destImage.m_height = srcImage.m_height;
    destImage.m_pixels.resize(destImage.m_width*destImage.m_height);

    SProgress progress("DFT:", srcImage.m_width * srcImage.m_height);
 
    // calculate 2d dft (brute force, not using fast fourier transform)
    for (int x = 0; x < srcImage.m_width; ++x)
    {
        for (int y = 0; y < srcImage.m_height; ++y)
        {
            // calculate DFT for that pixel / frequency
            destImage.m_pixels[y * destImage.m_width + x] = DFTPixel(srcImage, x, y);

            // update progress
            progress.Update();
        }
    }
}

uint8 InverseDFTPixel (const SImageDataComplex &srcImage, int K, int L)
{
    std::complex<float> total(0.0f, 0.0f);
    for (int x = 0; x < srcImage.m_width; ++x)
    {
        for (int y = 0; y < srcImage.m_height; ++y)
        {
            // Get the pixel value
            const std::complex<float> &src = srcImage.m_pixels[(y * srcImage.m_width) + x];

            // Add to the sum of the return value
            float v = float(K * x) / float(srcImage.m_width);
            v += float(L * y) / float(srcImage.m_height);
            std::complex<float> result = src * std::polar<float>(1.0f, 2.0f * c_pi * v);

            // sum up the results
            total += result;
        }
    }

    float idft = std::abs(total) / float(srcImage.m_width*srcImage.m_height);

    // make sure the values are in range
    if (idft < 0.0f)
        idft = 0.0f;
    if (idft > 1.0f)
        idft = 1.0;

    return uint8(idft * 255.0f);
}

void InverseDFTImage (const SImageDataComplex &srcImage, SImageData &destImage)
{
    // size the output image
    destImage.m_width = srcImage.m_width;
    destImage.m_height = srcImage.m_height;
    destImage.m_pitch = srcImage.m_width * 3;
    if (destImage.m_pitch & 3)
    {
        destImage.m_pitch &= ~3;
        destImage.m_pitch += 4;
    }
    destImage.m_pixels.resize(destImage.m_pitch*destImage.m_height);

    SProgress progress("Inverse DFT:", srcImage.m_width*srcImage.m_height);

    // calculate inverse 2d dft (brute force, not using fast fourier transform)
    for (int x = 0; x < srcImage.m_width; ++x)
    {
        for (int y = 0; y < srcImage.m_height; ++y)
        {
            // calculate DFT for that pixel / frequency
            uint8 idft = InverseDFTPixel(srcImage, x, y);
            uint8* dest = &destImage.m_pixels[y*destImage.m_pitch + x * 3];
            dest[0] = idft;
            dest[1] = idft;
            dest[2] = idft;

            // update progress
            progress.Update();
        }
    }
}

void GetMagnitudeData (const SImageDataComplex& srcImage, SImageData& destImage)
{
    // size the output image
    destImage.m_width = srcImage.m_width;
    destImage.m_height = srcImage.m_height;
    destImage.m_pitch = srcImage.m_width * 3;
    if (destImage.m_pitch & 3)
    {
        destImage.m_pitch &= ~3;
        destImage.m_pitch += 4;
    }
    destImage.m_pixels.resize(destImage.m_pitch*destImage.m_height);

    // get floating point magnitude data
    std::vector<float> magArray;
    magArray.resize(srcImage.m_width*srcImage.m_height);
    float maxmag = 0.0f;
    for (int x = 0; x < srcImage.m_width; ++x)
    {
        for (int y = 0; y < srcImage.m_height; ++y)
        {
            // Offset the information by half width & height in the positive direction.
            // This makes frequency 0 (DC) be at the image origin, like most diagrams show it.
            int k = (x + srcImage.m_width / 2) % srcImage.m_width;
            int l = (y + srcImage.m_height / 2) % srcImage.m_height;
            const std::complex<float> &src = srcImage.m_pixels[l*srcImage.m_width + k];

            float mag = std::abs(src);
            if (mag > maxmag)
                maxmag = mag;

            magArray[y*srcImage.m_width + x] = mag;
        }
    }
    if (maxmag == 0.0f)
        maxmag = 1.0f;

    const float c = 255.0f / log(1.0f+maxmag);

    // normalize the magnitude data and send it back in [0, 255]
    for (int x = 0; x < srcImage.m_width; ++x)
    {
        for (int y = 0; y < srcImage.m_height; ++y)
        {
            float src = c * log(1.0f + magArray[y*srcImage.m_width + x]);

            uint8 magu8 = uint8(src);

            uint8* dest = &destImage.m_pixels[y*destImage.m_pitch + x * 3];
            dest[0] = magu8;
            dest[1] = magu8;
            dest[2] = magu8;
        }
    }
}

void GetPhaseData (const SImageDataComplex& srcImage, SImageData& destImage)
{
    // size the output image
    destImage.m_width = srcImage.m_width;
    destImage.m_height = srcImage.m_height;
    destImage.m_pitch = srcImage.m_width * 3;
    if (destImage.m_pitch & 3)
    {
        destImage.m_pitch &= ~3;
        destImage.m_pitch += 4;
    }
    destImage.m_pixels.resize(destImage.m_pitch*destImage.m_height);

    // get floating point phase data, and encode it in [0,255]
    for (int x = 0; x < srcImage.m_width; ++x)
    {
        for (int y = 0; y < srcImage.m_height; ++y)
        {
            // Offset the information by half width & height in the positive direction.
            // This makes frequency 0 (DC) be at the image origin, like most diagrams show it.
            int k = (x + srcImage.m_width / 2) % srcImage.m_width;
            int l = (y + srcImage.m_height / 2) % srcImage.m_height;
            const std::complex<float> &src = srcImage.m_pixels[l*srcImage.m_width + k];

            // get phase, and change it from [-pi,+pi] to [0,255]
            float phase = (0.5f + 0.5f * std::atan2(src.real(), src.imag()) / c_pi);
            if (phase < 0.0f)
                phase = 0.0f;
            if (phase > 1.0f)
                phase = 1.0;
            uint8 phase255 = uint8(phase * 255);

            // write the phase as grey scale color
            uint8* dest = &destImage.m_pixels[y*destImage.m_pitch + x * 3];
            dest[0] = phase255;
            dest[1] = phase255;
            dest[2] = phase255;
        }
    }
}

int main (int argc, char **argv)
{
    float scale = 1.0f;
    int filter = 0;
 
    bool showUsage = argc < 2;
    char *srcFileName = argv[1];
 
    if (showUsage)
    {
        printf("Usage: <source>\n\n");
        return 1;
    }

    // trim off file extension from source filename so we can make our other file names
    char baseFileName[1024];
    strcpy(baseFileName, srcFileName);
    for (int i = strlen(baseFileName) - 1; i >= 0; --i)
    {
        if (baseFileName[i] == '.')
        {
            baseFileName[i] = 0;
            break;
        }
    }

    // Load source image if we can
    SImageData srcImage;
    if (LoadImage(srcFileName, srcImage))
    {
        printf("%s loaded (%i x %i)\n", srcFileName, srcImage.m_width, srcImage.m_height);

        // do DFT on a greyscale version of the image, instead of doing it per color channel
        SImageData greyImage;
        ImageToGrey(srcImage, greyImage);
        SImageDataComplex frequencyData;
        DFTImage(greyImage, frequencyData);

        // save magnitude information
        {
            char outFileName[1024];
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".raw.mag.bmp");

            SImageData destImage;
            GetMagnitudeData(frequencyData, destImage);
            SaveImage(outFileName, destImage);
        }

        // save phase information
        {
            char outFileName[1024];
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".raw.phase.bmp");

            SImageData destImage;
            GetPhaseData(frequencyData, destImage);
            SaveImage(outFileName, destImage);
        }

        // inverse dft the modified frequency and save the result
        {
            char outFileName[1024];
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".raw.idft.bmp");

            SImageData modifiedImage;
            InverseDFTImage(frequencyData, modifiedImage);
            SaveImage(outFileName, modifiedImage);
        }

        // Low Pass Filter: Remove high frequencies, write out frequency magnitudes, write out inverse dft
        {
            printf("\n=====LPF=====\n");

            // remove frequencies that are too far from frequency 0.
            // Note that even though our output frequency images have frequency 0 (DC) in the center, that
            // isn't actually how it's stored in our SImageDataComplex structure.  Pixel (0,0) is frequency 0.
            SImageDataComplex dft = frequencyData;
            float halfWidth = float(dft.m_width / 2);
            float halfHeight = float(dft.m_height / 2);
            for (int x = 0; x < dft.m_width; ++x)
            {
                for (int y = 0; y < dft.m_height; ++y)
                {
                    float relX = 0.0f;
                    float relY = 0.0f;

                    if (x < halfWidth)
                        relX = float(x) / halfWidth;
                    else
                        relX = (float(x) - float(dft.m_width)) / halfWidth;

                    if (y < halfHeight)
                        relY = float(y) / halfHeight;
                    else
                        relY = (float(y) - float(dft.m_height)) / halfHeight;

                    float dist = sqrt(relX*relX + relY*relY) / c_rootTwo; // divided by root 2 so our distance is from 0 to 1
                    if (dist > 0.1f)
                        dft.m_pixels[y*dft.m_width + x] = std::complex<float>(0.0f, 0.0f);
                }
            }

            // write dft magnitude data
            char outFileName[1024];
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".lpf.mag.bmp");
            SImageData destImage;
            GetMagnitudeData(dft, destImage);
            SaveImage(outFileName, destImage);

            // inverse dft and save the image
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".lpf.idft.bmp");
            SImageData modifiedImage;
            InverseDFTImage(dft, modifiedImage);
            SaveImage(outFileName, modifiedImage);
        }

        // High Pass Filter: Remove low frequencies, write out frequency magnitudes, write out inverse dft
        {
            printf("\n=====HPF=====\n");

            // remove frequencies that are too close to frequency 0.
            // Note that even though our output frequency images have frequency 0 (DC) in the center, that
            // isn't actually how it's stored in our SImageDataComplex structure.  Pixel (0,0) is frequency 0.
            SImageDataComplex dft = frequencyData;
            float halfWidth = float(dft.m_width / 2);
            float halfHeight = float(dft.m_height / 2);
            for (int x = 0; x < dft.m_width; ++x)
            {
                for (int y = 0; y < dft.m_height; ++y)
                {
                    float relX = 0.0f;
                    float relY = 0.0f;

                    if (x < halfWidth)
                        relX = float(x) / halfWidth;
                    else
                        relX = (float(x) - float(dft.m_width)) / halfWidth;

                    if (y < halfHeight)
                        relY = float(y) / halfHeight;
                    else
                        relY = (float(y) - float(dft.m_height)) / halfHeight;

                    float dist = sqrt(relX*relX + relY*relY) / c_rootTwo; // divided by root 2 so our distance is from 0 to 1
                    if (dist < 0.1f)
                        dft.m_pixels[y*dft.m_width + x] = std::complex<float>(0.0f, 0.0f);
                }
            }

            // write dft magnitude data
            char outFileName[1024];
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".hpf.mag.bmp");
            SImageData destImage;
            GetMagnitudeData(dft, destImage);
            SaveImage(outFileName, destImage);

            // inverse dft and save the image
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".hpf.idft.bmp");
            SImageData modifiedImage;
            InverseDFTImage(dft, modifiedImage);
            SaveImage(outFileName, modifiedImage);
        }

        // ZeroPhase
        {
            printf("\n=====Zero Phase=====\n");

            // Set phase to zero for all frequencies.
            // Note that even though our output frequency images have frequency 0 (DC) in the center, that
            // isn't actually how it's stored in our SImageDataComplex structure.  Pixel (0,0) is frequency 0.
            SImageDataComplex dft = frequencyData;
            float halfWidth = float(dft.m_width / 2);
            float halfHeight = float(dft.m_height / 2);
            for (int x = 0; x < dft.m_width; ++x)
            {
                for (int y = 0; y < dft.m_height; ++y)
                {
                    std::complex<float>& v = dft.m_pixels[y*dft.m_width + x];
                    float mag = std::abs(v);
                    v = std::complex<float>(mag, 0.0f);
                }
            }

            // write dft magnitude data
            char outFileName[1024];
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".phase0.mag.bmp");
            SImageData destImage;
            GetMagnitudeData(dft, destImage);
            SaveImage(outFileName, destImage);

            // inverse dft and save the image
            strcpy(outFileName, baseFileName);
            strcat(outFileName, ".phase0.idft.bmp");
            SImageData modifiedImage;
            InverseDFTImage(dft, modifiedImage);
            SaveImage(outFileName, modifiedImage);
        }
    }
    else
        printf("could not read 24 bit bmp file %s\n\n", srcFileName);

    return 0;
}

Links

Here are some links that I found useful:
Fourier Transform
http://www.thefouriertransform.com/
Introduction To Fourier Transforms For Image Processing

Comments

comments

Intro To Audio Synthesis For Music Presentation

Today I gave a presentation at work on the basics of audio synthesis for music. It seemed to go fairly well and I was surprised to hear that so many others also dabbled in audio synth and music.

The slide deck and example C++ program (uses portaudio) are both up on github here:
https://github.com/Atrix256/MusicSynth

Questions, feedback, etc? Drop me a line (:

Comments

comments

The Beating Effect

This post is going to be a pretty short one, so here it is 😛

The Beating Effect

The beating effect occurs when you play two similar frequencies of sound at the same time.

Because playing two sounds at once means adding them together, and due to the fact that sound waves are made up of positive and negative values (aka positive and negative pressures), the sounds playing at different frequencies will sometimes add peaks and troughs together to get louder, and other times the peak of one wave will add to the valley of another wave and the result will be quieter. This is know as constructive and destructive interference respectively.

The end result is that the sound will have a pulsing quality to it, like a tremolo effect. If one sound is played at frequency F1 and the other sound is played at frequency F2, the pulsing sound will occur at 2*(F2-F1) times a second.

Here’s a demo of this in action where the first sound is 200hz, the second sound is 205hz, and the result of them played together has a 10hz tremolo effect!

monobeat.wav

Binaural Beat

The beating effect happens when sound waves physically mix together.

Believe it or not though, there is a part of your brain where it mixes (adds) the sounds from each ear together as well.

That means that if you play similar frequencies to different ears, they will mix INSIDE YOUR BRAIN, and you will perceive a different kind of beating effect.

Below is a demo of that. You might need a pair of headphones to get the full effect.

stereobeat.wav

If you think this is pretty neat, you might want to google “psycho-acoustics” for more stuff like this (:

Source Code

Here’s the C++ code that generated these sound files.

#include <stdio.h>
#include <memory.h>
#include <inttypes.h>
#include <vector>

// constants
const float c_pi = 3.14159265359f;
const float c_twoPi = 2.0f * c_pi;

// typedefs
typedef uint16_t    uint16;
typedef uint32_t    uint32;
typedef int16_t     int16;
typedef int32_t     int32;

//this struct is the minimal required header data for a wav file
struct SMinimalWaveFileHeader
{
    //the main chunk
    unsigned char m_chunkID[4];
    uint32		  m_chunkSize;
    unsigned char m_format[4];

    //sub chunk 1 "fmt "
    unsigned char m_subChunk1ID[4];
    uint32		  m_subChunk1Size;
    uint16		  m_audioFormat;
    uint16		  m_numChannels;
    uint32		  m_sampleRate;
    uint32		  m_byteRate;
    uint16		  m_blockAlign;
    uint16		  m_bitsPerSample;

    //sub chunk 2 "data"
    unsigned char m_subChunk2ID[4];
    uint32		  m_subChunk2Size;

    //then comes the data!
};

//this writes
template <typename T>
bool WriteWaveFile(const char *fileName, std::vector<T> data, int16 numChannels, int32 sampleRate)
{
    int32 dataSize = data.size() * sizeof(T);
    int32 bitsPerSample = sizeof(T) * 8;

    //open the file if we can
    FILE *File = nullptr;
    fopen_s(&File, fileName, "w+b");
    if (!File)
        return false;

    SMinimalWaveFileHeader waveHeader;

    //fill out the main chunk
    memcpy(waveHeader.m_chunkID, "RIFF", 4);
    waveHeader.m_chunkSize = dataSize + 36;
    memcpy(waveHeader.m_format, "WAVE", 4);

    //fill out sub chunk 1 "fmt "
    memcpy(waveHeader.m_subChunk1ID, "fmt ", 4);
    waveHeader.m_subChunk1Size = 16;
    waveHeader.m_audioFormat = 1;
    waveHeader.m_numChannels = numChannels;
    waveHeader.m_sampleRate = sampleRate;
    waveHeader.m_byteRate = sampleRate * numChannels * bitsPerSample / 8;
    waveHeader.m_blockAlign = numChannels * bitsPerSample / 8;
    waveHeader.m_bitsPerSample = bitsPerSample;

    //fill out sub chunk 2 "data"
    memcpy(waveHeader.m_subChunk2ID, "data", 4);
    waveHeader.m_subChunk2Size = dataSize;

    //write the header
    fwrite(&waveHeader, sizeof(SMinimalWaveFileHeader), 1, File);

    //write the wave data itself
    fwrite(&data[0], dataSize, 1, File);

    //close the file and return success
    fclose(File);
    return true;
}

template <typename T>
void ConvertFloatSamples (const std::vector<float>& in, std::vector<T>& out)
{
    // make our out samples the right size
    out.resize(in.size());

    // convert in format to out format !
    for (size_t i = 0, c = in.size(); i < c; ++i)
    {
        float v = in[i];
        if (v < 0.0f)
            v *= -float(std::numeric_limits<T>::lowest());
        else
            v *= float(std::numeric_limits<T>::max());
        out[i] = T(v);
    }
}

void GenerateMonoBeatingSamples (std::vector<float>& samples, int sampleRate)
{
    int sectionLength = samples.size() / 3;
    int envelopeLen = sampleRate / 20;

    for (int index = 0, numSamples = samples.size(); index < numSamples; ++index)
    {
        samples[index] = 0.0f;
        int section = index / sectionLength;
        int sectionOffset = index % sectionLength;

        // apply an envelope at front and back  of each section keep it from popping between sounds
        float envelope = 1.0f;
        if (sectionOffset < envelopeLen)
            envelope = float(sectionOffset) / float(envelopeLen);
        else if (sectionOffset > sectionLength - envelopeLen)
            envelope = (float(sectionLength) - float(sectionOffset)) / float(envelopeLen);

        // first sound
        if (section == 0 || section == 2)
            samples[index] += sin(float(index) * c_twoPi * 200.0f / float(sampleRate)) * envelope;

        // second sound
        if (section == 1 || section == 2)
            samples[index] += sin(float(index) * c_twoPi * 205.0f / float(sampleRate)) * envelope;

        // scale it to prevent clipping
        if (section == 2)
            samples[index] *= 0.5f;
        samples[index] *= 0.95f;
    }
}

void GenerateStereoBeatingSamples (std::vector<float>& samples, int sampleRate)
{
    int sectionLength = (samples.size() / 2) / 3;
    int envelopeLen = sampleRate / 20;

    for (int index = 0, numSamples = samples.size() / 2; index < numSamples; ++index)
    {
        samples[index * 2] = 0.0f;
        samples[index * 2 + 1] = 0.0f;

        int section = index / sectionLength;
        int sectionOffset = index % sectionLength;

        // apply an envelope at front and back  of each section keep it from popping between sounds
        float envelope = 1.0f;
        if (sectionOffset < envelopeLen)
            envelope = float(sectionOffset) / float(envelopeLen);
        else if (sectionOffset > sectionLength - envelopeLen)
            envelope = (float(sectionLength) - float(sectionOffset)) / float(envelopeLen);
        envelope *= 0.95f;

        // first sound
        if (section == 0 || section == 2)
            samples[index * 2] += sin(float(index) * c_twoPi * 200.0f / float(sampleRate)) * envelope;

        // second sound
        if (section == 1 || section == 2)
            samples[index * 2 + 1] += sin(float(index) * c_twoPi * 205.0f / float(sampleRate)) * envelope;
    }
}

//the entry point of our application
int main(int argc, char **argv)
{
    // generate the mono beating effect
    {
        // sound format parameters
        const int c_sampleRate = 44100;
        const int c_numSeconds = 4;
        const int c_numChannels = 1;
        const int c_numSamples = c_sampleRate * c_numChannels * c_numSeconds;

        // make space for our samples
        std::vector<float> samples;
        samples.resize(c_numSamples);

        // generate samples
        GenerateMonoBeatingSamples(samples, c_sampleRate);

        // convert from float to the final format
        std::vector<int32> samplesInt;
        ConvertFloatSamples(samples, samplesInt);

        // write our samples to a wave file
        WriteWaveFile("monobeat.wav", samplesInt, c_numChannels, c_sampleRate);
    }

    // generate the stereo beating effect (binaural beat)
    {
        // sound format parameters
        const int c_sampleRate = 44100;
        const int c_numSeconds = 4;
        const int c_numChannels = 2;
        const int c_numSamples = c_sampleRate * c_numChannels * c_numSeconds;

        // make space for our samples
        std::vector<float> samples;
        samples.resize(c_numSamples);

        // generate samples
        GenerateStereoBeatingSamples(samples, c_sampleRate);

        // convert from float to the final format
        std::vector<int32> samplesInt;
        ConvertFloatSamples(samples, samplesInt);

        // write our samples to a wave file
        WriteWaveFile("stereobeat.wav", samplesInt, c_numChannels, c_sampleRate);
    }
}

Links

For a more mathematical explanation of these, check out wikipedia (:
Wikipedia: Beat (acoustics)

Wikipedia: Binaural beats

Comments

comments

Synthesizing a Plucked String Sound With the Karplus-Strong Algorithm

If you are looking to synthesize the sound of a plucked string, there is an amazingly simple algorithm for doing so called the Karplus-Strong Algorithm.

Give it a listen: KarplusStrong.wav
Here it is with flange and reverb effects applied: KPFlangeReverb.wav

It works like this:

  1. Fill a circular buffer with static (random numbers)
  2. Play the contents of the circular buffer over and over
  3. Each time you play a sample, replace that sample with the average of itself and the next sample in the buffer. Also multiplying that average by a feedback value (like say, 0.996)

Amazingly, that is all there is to it!

Why Does That Work?!

The reason this works is that it is actually very similar to how a real guitar string pluck works.

When you pluck a guitar string, if you had a perfect pluck at the perfect location with the perfect transfer of energy, you’d get a note that was “perfect”. It wouldn’t be a pure sine wave since strings have harmonics (integer multiple frequencies) beyond their basic tuning, but it would be a pure note.

In reality, that isn’t what happens, so immediately after plucking the string, there is a lot of vibrations in there that “don’t belong” due to the imperfect pluck. Since the string is tuned, it wants to be vibrating a specific way, so over time the vibrations evolve from the imperfect pluck vibrations to the tuning of the guitar string. As you average the samples together, you are removing the higher frequency noise/imperfections. Averaging is a crude low pass filter. This makes it converge to the right frequencies over time.

It’s also important to note that with a real stringed instrument, when you play a note, the high frequencies disappear before the low frequencies. This averaging / low pass filter makes that happen as well and is part of what helps it sound so realistic.

Also while all that is going on, the energy in the string is being diminished as it becomes heat and sound waves and such, so the noise gets quieter over time. When you multiply the values by a feedback value which is less than 1, you are simulating this loss of energy by making the values get smaller over time.

Tuning The Note

This wasn’t intuitive for me at first, but the frequency that the note plays at is determined ENTIRELY by the size of the circular buffer.

If your audio has a sample rate of 44100hz (44100 samples played a second), and you use this algorithm with a buffer size of 200 samples, that means that the note synthesized will be 220.5hz. This is because 44100/200 = 220.5.

Thinking about the math from another direction, we can figure out what our buffer size needs to be for a specific frequency. If our sample rate is 44100hz and we want to play a note at 440hz, that means we need a buffer size of 100.23 samples. This is because 44100/440 = 100.23. Since we can’t have a fractional number of samples, we can just round to 100.

You can actually deal with the fractional buffer size by stepping through the ring buffer in non integer steps and using the fraction to interpolate audio samples, but I’ll leave that as an exercise for you if you want that perfectly tuned note. IMO leaving it slightly off could actually be a good thing. What guitar is ever perfectly in tune, right?! With it being slightly out of tune, it’s more likely to make more realistic sounds and sound interactions when paired with other instruments.

You are probably wondering like I was, why the buffer size affects the frequency of the note. The reason for this is actually pretty simple and intuitive after all.

The reason is because the definition of frequency is just how many times a wave form repeats per second. The wave form could be a sine wave, a square wave, a triangle wave, or it could be something more complex, but frequency is always the number of repetitions per second. If you think about our ring buffer as being a wave form, you can now see that if we have a buffer size of 200 samples, and a sample rate of 44100hz, when we play that buffer continually, it’s going to play back 220.5 times every second, which means it will play with a frequency of 220.5!

Sure, we modify the buffer (and waveform) as we play it, but the modifications are small, so the waveform is similar from play to play.

Some More Details

I’ve found that this algorithm doesn’t work as well with low frequency notes as it does with high frequency notes.

They say you can prime the buffer with a saw tooth wave (or other wave forms) instead of static (noise). While it still “kind of works”, in my experimentation, it didn’t work out that well.

You could try using other low pass filters to see if that affects the quality of the note generated. The simple averaging method works so well, I didn’t explore alternative options very much.

Kmm on hacker news commented that averaging the current sample with the last and next, instead of just the next had the benefit that the wave form didn’t move forward half a step each play through and that there is an audible difference between the techniques. I gave it a try and sure enough, there is an audible difference, the sound is less harsh on the ears. I believe this is so because averaging 3 samples instead of 2 is a stronger low pass filter, so gets rid of higher frequencies faster.

Example Code

Here is the C++ code that generated the sample at the top of the post. Now that you can generate plucked string sounds, you can add some distortion, flange, reverb, etc and make some sweet (synthesized) metal without having to learn to play guitar and build up finger calluses 😛

#include <stdio.h>
#include <memory.h>
#include <inttypes.h>
#include <vector>

// constants
const float c_pi = 3.14159265359f;
const float c_twoPi = 2.0f * c_pi;

// typedefs
typedef uint16_t    uint16;
typedef uint32_t    uint32;
typedef int16_t     int16;
typedef int32_t     int32;

//this struct is the minimal required header data for a wav file
struct SMinimalWaveFileHeader
{
    //the main chunk
    unsigned char m_chunkID[4];
    uint32		  m_chunkSize;
    unsigned char m_format[4];

    //sub chunk 1 "fmt "
    unsigned char m_subChunk1ID[4];
    uint32		  m_subChunk1Size;
    uint16		  m_audioFormat;
    uint16		  m_numChannels;
    uint32		  m_sampleRate;
    uint32		  m_byteRate;
    uint16		  m_blockAlign;
    uint16		  m_bitsPerSample;

    //sub chunk 2 "data"
    unsigned char m_subChunk2ID[4];
    uint32		  m_subChunk2Size;

    //then comes the data!
};

//this writes
template <typename T>
bool WriteWaveFile(const char *fileName, std::vector<T> data, int16 numChannels, int32 sampleRate)
{
    int32 dataSize = data.size() * sizeof(T);
    int32 bitsPerSample = sizeof(T) * 8;

    //open the file if we can
    FILE *File = nullptr;
    fopen_s(&File, fileName, "w+b");
    if (!File)
        return false;

    SMinimalWaveFileHeader waveHeader;

    //fill out the main chunk
    memcpy(waveHeader.m_chunkID, "RIFF", 4);
    waveHeader.m_chunkSize = dataSize + 36;
    memcpy(waveHeader.m_format, "WAVE", 4);

    //fill out sub chunk 1 "fmt "
    memcpy(waveHeader.m_subChunk1ID, "fmt ", 4);
    waveHeader.m_subChunk1Size = 16;
    waveHeader.m_audioFormat = 1;
    waveHeader.m_numChannels = numChannels;
    waveHeader.m_sampleRate = sampleRate;
    waveHeader.m_byteRate = sampleRate * numChannels * bitsPerSample / 8;
    waveHeader.m_blockAlign = numChannels * bitsPerSample / 8;
    waveHeader.m_bitsPerSample = bitsPerSample;

    //fill out sub chunk 2 "data"
    memcpy(waveHeader.m_subChunk2ID, "data", 4);
    waveHeader.m_subChunk2Size = dataSize;

    //write the header
    fwrite(&waveHeader, sizeof(SMinimalWaveFileHeader), 1, File);

    //write the wave data itself
    fwrite(&data[0], dataSize, 1, File);

    //close the file and return success
    fclose(File);
    return true;
}

template <typename T>
void ConvertFloatSamples (const std::vector<float>& in, std::vector<T>& out)
{
    // make our out samples the right size
    out.resize(in.size());

    // convert in format to out format !
    for (size_t i = 0, c = in.size(); i < c; ++i)
    {
        float v = in[i];
        if (v < 0.0f)
            v *= -float(std::numeric_limits<T>::lowest());
        else
            v *= float(std::numeric_limits<T>::max());
        out[i] = T(v);
    }
}
//calculate the frequency of the specified note.
//fractional notes allowed!
float CalcFrequency(float octave, float note)
/*
	Calculate the frequency of any note!
	frequency = 440×(2^(n/12))

	N=0 is A4
	N=1 is A#4
	etc...

	notes go like so...
	0  = A
	1  = A#
	2  = B
	3  = C
	4  = C#
	5  = D
	6  = D#
	7  = E
	8  = F
	9  = F#
	10 = G
	11 = G#
*/
{
    return (float)(440 * pow(2.0, ((double)((octave - 4) * 12 + note)) / 12.0));
}

class CKarplusStrongStringPluck
{
public:
    CKarplusStrongStringPluck (float frequency, float sampleRate, float feedback)
    {
        m_buffer.resize(uint32(float(sampleRate) / frequency));
        for (size_t i = 0, c = m_buffer.size(); i < c; ++i) {
            m_buffer[i] = ((float)rand()) / ((float)RAND_MAX) * 2.0f - 1.0f;  // noise
            //m_buffer[i] = float(i) / float(c); // saw wave
        }
        m_index = 0;
        m_feedback = feedback;
    }

    float GenerateSample ()
    {
        // get our sample to return
        float ret = m_buffer[m_index];

        // low pass filter (average) some samples
        float value = (m_buffer[m_index] + m_buffer[(m_index + 1) % m_buffer.size()]) * 0.5f * m_feedback;
        m_buffer[m_index] = value;

        // move to the next sample
        m_index = (m_index + 1) % m_buffer.size();

        // return the sample from the buffer
        return ret;
    }

private:
    std::vector<float>  m_buffer;
    size_t              m_index;
    float               m_feedback;
};

void GenerateSamples (std::vector<float>& samples, int sampleRate)
{
    std::vector<CKarplusStrongStringPluck> notes;

    enum ESongMode {
        e_twinkleTwinkle,
        e_strum
    };

    int timeBegin = 0;
    ESongMode mode = e_twinkleTwinkle;
    for (int index = 0, numSamples = samples.size(); index < numSamples; ++index)
    {
        switch (mode) {
            case e_twinkleTwinkle: {
                const int c_noteTime = sampleRate / 2;
                int time = index - timeBegin;
                // if we should start a new note
                if (time % c_noteTime == 0) {
                    int note = time / c_noteTime;
                    switch (note) {
                        case 0:
                        case 1: {
                            notes.push_back(CKarplusStrongStringPluck(CalcFrequency(3, 0), float(sampleRate), 0.996f));
                            break;
                        }
                        case 2:
                        case 3: {
                            notes.push_back(CKarplusStrongStringPluck(CalcFrequency(3, 7), float(sampleRate), 0.996f));
                            break;
                        }
                        case 4:
                        case 5: {
                            notes.push_back(CKarplusStrongStringPluck(CalcFrequency(3, 9), float(sampleRate), 0.996f));
                            break;
                        }
                        case 6: {
                            notes.push_back(CKarplusStrongStringPluck(CalcFrequency(3, 7), float(sampleRate), 0.996f));
                            break;
                        }
                        case 7: {
                            mode = e_strum;
                            timeBegin = index+1;
                            break;
                        }
                    }
                }
                break;
            }
            case e_strum: {
                const int c_noteTime = sampleRate / 32;
                int time = index - timeBegin - sampleRate;
                // if we should start a new note
                if (time % c_noteTime == 0) {
                    int note = time / c_noteTime;
                    switch (note) {
                        case 0: notes.push_back(CKarplusStrongStringPluck(55.0f, float(sampleRate), 0.996f)); break;
                        case 1: notes.push_back(CKarplusStrongStringPluck(55.0f + 110.0f, float(sampleRate), 0.996f)); break;
                        case 2: notes.push_back(CKarplusStrongStringPluck(55.0f + 220.0f, float(sampleRate), 0.996f)); break;
                        case 3: notes.push_back(CKarplusStrongStringPluck(55.0f + 330.0f, float(sampleRate), 0.996f)); break;
                        case 4: mode = e_strum; timeBegin = index + 1; break;
                    }
                }
                break;
            }
        }

        // generate and mix our samples from our notes
        samples[index] = 0;
        for (CKarplusStrongStringPluck& note : notes)
            samples[index] += note.GenerateSample();

        // to keep from clipping
        samples[index] *= 0.5f;
    }
}

//the entry point of our application
int main(int argc, char **argv)
{
    // sound format parameters
    const int c_sampleRate = 44100;
    const int c_numSeconds = 9;
    const int c_numChannels = 1;
    const int c_numSamples = c_sampleRate * c_numChannels * c_numSeconds;

    // make space for our samples
    std::vector<float> samples;
    samples.resize(c_numSamples);

    // generate samples
    GenerateSamples(samples, c_sampleRate);

    // convert from float to the final format
    std::vector<int32> samplesInt;
    ConvertFloatSamples(samples, samplesInt);

    // write our samples to a wave file
    WriteWaveFile("out.wav", samplesInt, c_numChannels, c_sampleRate);
}

Links

Hacker News Discussion (This got up to topic #7, woo!)

Wikipedia: Karplus-Strong String Synthesis

Princeton COS 126: Plucking a Guitar String

Shadertoy: Karplus-Strong Variation (Audio) – I tried to make a bufferless Karplus-Strong implementation on shadertoy. It didn’t quite work out but is still a bit interesting.

Comments

comments

Who Cares About Dynamic Array Growth Strategies?

Let’s say that you’ve allocated an array of 20 integers and have used them all. Now it’s time to allocate more, but you aren’t quite sure how many integers you will need in the end. What do you do?

Realloc is probably what you think of first for solving this problem, but let’s ignore that option for the moment. (If you haven’t used realloc before, give this a read! Alloca and Realloc – Useful Tools, Not Ancient Relics)

Without realloc you are left with allocating a new buffer of memory, copying the old buffer to the new buffer, and then freeing the old buffer.

The question remains though, how much memory should you allocate for this new, larger buffer?

You could double your current buffer size whenever you ran out of space. This would mean that as the buffer grew over time, you would do fewer allocations but would have more wasted (allocated but unused) memory.

You could also go the other way and just add 10 more int’s every time you ran out of space. This would mean that you would do a larger number of allocations (more CPU usage, possibly more fragmentation), but you’d end up with less wasted space.

Either way, it obviously depends entirely on usage patterns and it’s all subjective and situational.

…Or is it?

A Surprising Reason For Caring

Believe it or not, growth strategies can make a huge difference. Below we will explore the difference between the seemingly arbitrary rules of making a buffer twice as big, or 1.5 times as big.

Let’s say that we have a bunch of free memory starting at address 0. Let’s analyze what happens as we resize arrays in each scenario.

2x Buffer Size

First let’s see what happens when we double a buffer’s size when it gets full.

We start by allocating 16 bytes. The allocator gives us address 0 for our pointer.

When the buffer gets full, we allocate 32 bytes (at address 16), copy the 16 bytes into it and then free our first 16 byte buffer.

When that buffer gets full, we allocate 64 bytes (at address 48), copy the 32 bytes into it and then free our 32 byte buffer.

Lastly, that buffer gets full, so we allocate 128 bytes (at address 112), copy the 64 bytes into it and then free our 64 byte buffer.

As you can see, doubling the buffer size causes our pointer to keep moving further down in address space, and a free piece of memory before it will never be large enough to hold a future allocation.

1.5x Buffer Size

Let’s see what happens when we make a buffer 1.5x as large when it gets full.

We start by allocating 16 bytes. The allocator gives us address 0 for our pointer.

When the buffer gets full, we allocate 24 bytes (at address 16), copy the 16 bytes into it and then free our first 16 byte buffer.

When that buffer gets full, we allocate 36 bytes (at address 40), copy the 24 bytes into it and free the 24 byte buffer.

When that buffer gets full, we allocate 54 bytes (at address 76), copy the 36 bytes into it and free the 36 byte buffer.

When that buffer gets full, we allocate 81 bytes (at address 130), copy the 54 bytes into it and free the 54 byte buffer.

Lastly, when that buffer gets full, we need to allocate 122 bytes (we rounded it up). In this case, there is 130 bytes of unused memory starting at address 0, so we can just allocate 122 of those bytes, copy our 81 bytes into it and free the 81 byte buffer.

Our allocations have folded back into themselves. Our pattern of resizing hasn’t created an ever moving / ever growing memory fragmentation monster, unlike the buffer size doubling, which has!

Small Print

The above does decrease memory fragmentation, by encouraging an allocation to tend to stay in one spot in memory, but it comes at a cost. That cost is that since it’s allocating less extra memory when it runs out, that you will end up having to do more allocations to reach the same level of memory usage.

That might be a benefit though, depending on your specific needs. Another way of looking at that is that you will end up with fewer bytes of wasted memory. By wasted memory I mean allocated bytes which are not actually used to store anything.

Realloc Makes This Moot Right?

You may be thinking “well if I use realloc, I don’t need to care about this right?”

That isn’t exactly true. If realloc is unable to give you more memory at the current pointer location, it will allocate a new buffer, copy the old data to the new buffer, free the old buffer, and return you the pointer to the new buffer. This is exactly the case that happens when you don’t use realloc.

Using the above growth strategy with realloc makes realloc work even better. It’s a good thing!

Caveat: exotic allocator behavior may not actually benefit from using this strategy with realloc, so have a look for yourself if you are in doubt!

Links

Here’s a discussion on the topic:
What is the ideal growth rate for a dynamically allocated array?

From the link above, apparently the ideal factor to use when upsizing a buffer in general (when worrying about fragmentation like this), is the golden ratio 1.618. Weird, huh?

Have other information about why various growth factors matter? Leave a comment or drop me a line (:

Thanks to Tom for mentioning this concept. Pretty interesting and surprising IMO.

Comments

comments

Shamir’s Quest: Collect Any 3 Keys To Unlock The Secret!

This post is on something called Shamir’s Secret Sharing. It’s a technique where you can break a secret number up into M different pieces, where if you have any N of those M pieces, you are able to figure out the secret.

Thinking of it in video game terms, imagine there are 10 keys hidden in a level, but you can escape the level whenever you find any 7 of them. This is what Shamir’s Secret Sharing enables you to set up cryptographically.

Interestingly in this case, the term sharing in “secret sharing” doesn’t mean sharing the secret with others. It means breaking the secret up into pieces, or SHARES. Secret sharing means that you make shares out of a secret, such that if you have enough of the shares, you can recover the secret.

How Do You Share (Split) The Secret?

The basic idea of how it works is actually really simple. This is good for us trying to learn the technique, but also good to show it’s security since there are so few moving parts.

It relies on something called the Unisolvence Theorem which is a fancy label meaning these things:

  • If you have a linear equation, it takes two (x,y) points to uniquely identify that line. No matter how you write a linear equation, if it passes through those same two points, it’s mathematically equivelant.
  • If you have a quadratic equation, it takes three (x,y) points to uniquely identify that quadratic curve. Again, no matter how you write a quadratic equation, if it passes through those same three points, it’s mathematically equivalent.
  • The pattern continues for equations of any degree. Cubic equations require four points to be uniquely identified, Quartic equations require five points, and so on.

At a high level, how this technique works is that the number of shares (keys) you want someone to collect (N ) defines the degree of an equation.

You use random numbers as the coefficients of the powers of x in that equation, but use your secret number as the constant term.

You then create M data points of the form (x,y) aka (x,f(x)) . Those are your shares. You then give individual shares to people, or go hide them in your dungeon or do whatever you are going to do with them.

As soon as any one person has N of those M shares (data points), they will be able to figure out the equation of the curve and thus get the secret.

The secret number is the constant term of the polynomial, which is also just f(0) .

This image below from wikipedia is great for seeing how you may have two points of a cubic curve, but without a third point you can’t be sure what the quadratic equation is. In fact, there are an infinite number of quadratic curves that pass through any two points! Because of that, it takes the full number of required shares for you to be able to unlock the secret.

Example: Sharing (Splitting) The Secret

First you decide how many shares you want it to take to unlock the secret. This determines the degree of your equation.

Let’s say you wanted a person to have to have four shares to unlock the secret. This means our equation will be a cubic equation, since it takes four points to uniquely define a cubic equation.

Our equation is:

f(x) = R_1x^3 + R_2x^2 + R_3x + S

Where the R_i values are random numbers, and S is the secret value.

Let’s say that our secret value is 435, and that we picked some random numbers for the equation, making the below:

f(x) = 28x^3 + 64x^2 + 9x + 435

We now have a function that is uniquely identifiable by any 4 points of data on it’s curve.

Next we decide how many pieces we are going to create total. We need at least 4 so that it is in fact solvable. Let’s make 6 shares.

To do this, you just plug in 6 different values of x and pair each x value with it’s y value. Let’s do that:

\begin{array}{c|c}  x & f(x) \\  \hline  1 & 536 \\  2 & 933 \\  3 & 1794 \\  4 & 3287 \\  5 & 5580 \\  6 & 8841 \\  \end{array}

When doing this part, remember that the secret number is f(0) , so make sure and not share what the value of the function is when x is 0!

You could then distribute the shares (data pairs) as you saw fit. Maybe some people are more important, so you give them more than one share, requiring a smaller amount of cooperation with them to unlock the secret.

Share distribution details are totally up to you, but we now have our shares, whereby if you have any of the 4 of the 6 total shares, you can unlock the secret.

How Do You Join The Secret?

Once you have the right number of shares and you know the degree of the polynomial (pre-shared “public” information), unlocking the secret is a pretty straightforward process too. To unlock the secret, you just need to use ANY method available for creating an equation of the correct degree from a set of data points.

This can be one of several different interpolation techniques, but the most common one to use seems to be Lagrange interpolation, which is something I previously wrote up that you can read about here: Lagrange Interpolation.

Once you have the equation, you can either evaluate f(0) , or you can write the equation in polynomial form and the constant term will be the secret value.

Example: Joining the Secret

Let’s say that we have these four shares and are ready to get the cubic function and then unlock the secret number:

\begin{array}{c|c}  x & y \\  \hline  1 & 536 \\  2 & 933 \\  4 & 3287 \\  6 & 8841 \\  \end{array}

We could bust out some Lagrange interpolation and figure this out, but let’s be lazy… err efficient I mean. Wolfram alpha can do this for us!

Wolfram Alpha: cubic fit (1, 536), (2, 933), (4, 3287), (6, 8841)

That gives us this equation, saying that it is a perfect fit (which it is!)
28x^3 + 64x^2 + 9x + 435

You can see that our constant term (and f(0) ) is the correct secret value of 435.

Daaaayummm Bru… that is lit AF! We just got hacked by wolfram alpha 😛

A Small Complication

Unfortunately, the above has a weakness. The weakness is that each share you get gives you a little bit more information about the secret value. You can read more about this in the links section at the end if you want to know more details.

Ideally, you wouldn’t have any information about the secret value until you had the full number of shares required to unlock the secret.

To address this problem, we are going to choose some prime number k and instead of shares being (x,y) data points on the curve, they are going to be (x,y \bmod k) . In technical terms we are going to be using points on a finite field, or a Galois field.

The value we choose for k needs to be larger than any of the coefficients of our terms (the random numbers) as well as larger than our secret value and larger than the number of shares we want to create. The larger the better besides that, because a larger k value means a larger “brute force” space to search.

If you want to use this technique in a situation which has real needs for security, please make sure and read more on this technique from more authoritative sources. I’m glossing over the details of security quite a bit, and just trying to give an intuitive understanding of this technique (:

Source Code

Below is some sample source code that implements Shamir’s Secret Sharing in C++.

I use 64 bit integers, but if you were going to be using this in a realistic situation you could very well overflow 64 bit ints and get the wrong answers. I hit this problem for instance when trying to require more than about 10 shares, using a prime of 257, and generating 50 shares. If you hit the limit of 64 bit ints you can use a multi precision math library instead to have virtually unlimited sized ints. The boost multiprecision header library is a decent choice for multi precision integers, specifically cpp_int.

#include <stdio.h>
#include <array>
#include <vector>
#include <math.h>
#include <random>
#include <assert.h>
#include <stdint.h>
#include <inttypes.h>

typedef int64_t TINT;
typedef std::array<TINT, 2> TShare;
typedef std::vector<TShare> TShares;

class CShamirSecretSharing
{
public:
    CShamirSecretSharing (size_t sharesNeeded, TINT prime)
        : c_sharesNeeded(sharesNeeded), c_prime(prime)
    {
        // There needs to be at least 1 share needed
        assert(sharesNeeded > 0);
    }

    // Generate N shares for a secretNumber
    TShares GenerateShares (TINT secretNumber, TINT numShares) const
    {
        // calculate our curve coefficients
        std::vector<TINT> coefficients;
        {
            // store the secret number as the first coefficient;
            coefficients.resize((size_t)c_sharesNeeded);
            coefficients[0] = secretNumber;

            // randomize the rest of the coefficients
            std::array<int, std::mt19937::state_size> seed_data;
            std::random_device r;
            std::generate_n(seed_data.data(), seed_data.size(), std::ref(r));
            std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
            std::mt19937 gen(seq);
            std::uniform_int_distribution<TINT> dis(1, c_prime - 1);
            for (TINT i = 1; i < c_sharesNeeded; ++i)
                coefficients[(size_t)i] = dis(gen);
        }

        // generate the shares
        TShares shares;
        shares.resize((size_t)numShares);
        for (size_t i = 0; i < numShares; ++i)
            shares[i] = GenerateShare(i + 1, coefficients);
        return shares;
    }

    // use lagrange polynomials to find f(0) of the curve, which is the secret number
    TINT JoinShares (const TShares& shares) const
    {
        // make sure there is at elast the minimum number of shares
        assert(shares.size() >= size_t(c_sharesNeeded));

        // Sigma summation loop
        TINT sum = 0;
        for (TINT j = 0; j < c_sharesNeeded; ++j)
        {
            TINT y_j = shares[(size_t)j][1];

            TINT numerator = 1;
            TINT denominator = 1;

            // Pi product loop
            for (TINT m = 0; m < c_sharesNeeded; ++m)
            {
                if (m == j)
                    continue;

                numerator = (numerator * shares[(size_t)m][0]) % c_prime;
                denominator = (denominator * (shares[(size_t)m][0] - shares[(size_t)j][0])) % c_prime;
            }

            sum = (c_prime + sum + y_j * numerator * modInverse(denominator, c_prime)) % c_prime;
        }
        return sum;
    }

    const TINT GetPrime () const { return c_prime; }
    const TINT GetSharesNeeded () const { return c_sharesNeeded; }

private:

    // Generate a single share in the form of (x, f(x))
    TShare GenerateShare (TINT x, const std::vector<TINT>& coefficients) const
    {
        TINT xpow = x;
        TINT y = coefficients[0];
        for (TINT i = 1; i < c_sharesNeeded; ++i) {
            y += coefficients[(size_t)i] * xpow;
            xpow *= x;
        }
        return{ x, y % c_prime };
    }

    // Gives the decomposition of the gcd of a and b.  Returns [x,y,z] such that x = gcd(a,b) and y*a + z*b = x
    static const std::array<TINT, 3> gcdD (TINT a, TINT b) {
        if (b == 0)
            return{ a, 1, 0 };

        const TINT n = a / b;
        const TINT c = a % b;
        const std::array<TINT, 3> r = gcdD(b, c);

        return{ r[0], r[2], r[1] - r[2] * n };
    }

    // Gives the multiplicative inverse of k mod prime.  In other words (k * modInverse(k)) % prime = 1 for all prime > k >= 1 
    static TINT modInverse (TINT k, TINT prime) {
        k = k % prime;
        TINT r = (k < 0) ? -gcdD(prime, -k)[2] : gcdD(prime, k)[2];
        return (prime + r) % prime;
    }

private:
    
    // Publically known information
    const TINT          c_prime;
    const TINT          c_sharesNeeded;
};

void WaitForEnter ()
{
    printf("Press Enter to quit");
    fflush(stdin);
    getchar();
}

int main (int argc, char **argv)
{
    // Parameters
    const TINT c_secretNumber = 435;
    const TINT c_sharesNeeded = 7;
    const TINT c_sharesMade = 50;
    const TINT c_prime = 439;   // must be a prime number larger than the other three numbers above

    // set up a secret sharing object with the public information
    CShamirSecretSharing secretSharer(c_sharesNeeded, c_prime);

    // split a secret value into multiple shares
    TShares shares = secretSharer.GenerateShares(c_secretNumber, c_sharesMade);

    // shuffle the shares, so it's random which ones are used to join
    std::array<int, std::mt19937::state_size> seed_data;
    std::random_device r;
    std::generate_n(seed_data.data(), seed_data.size(), std::ref(r));
    std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
    std::mt19937 gen(seq);
    std::shuffle(shares.begin(), shares.end(), gen);

    // join the shares
    TINT joinedSecret = secretSharer.JoinShares(shares);

    // show the public information and the secrets being joined
    printf("%" PRId64 " shares needed, %i shares made\n", secretSharer.GetSharesNeeded(), shares.size());
    printf("Prime = %" PRId64 "\n\n", secretSharer.GetPrime());
    for (TINT i = 0, c = secretSharer.GetSharesNeeded(); i < c; ++i)
        printf("Share %" PRId64 " = (%" PRId64 ", %" PRId64 ")\n", i+1, shares[i][0], shares[i][1]);

    // show the result
    printf("\nJoined Secret = %" PRId64 "\nActual Secret = %" PRId64 "\n\n", joinedSecret, c_secretNumber);
    assert(joinedSecret == c_secretNumber);
    WaitForEnter();
    return 0;
}

Example Output

Here is some example output of the program:

Links

Wikipedia: Shamir’s Secret Sharing (Note: for some reason the example javascript implementation here only worked for odd numbered keys required)
Wikipedia: Finite Field
Cryptography.wikia.com: Shamir’s Secret Sharing
Java Implementation of Shamir’s Secret Sharing (Note: I don’t think this implementation is correct, and neither is the one that someone posted to correct them!)

When writing this post I wondered if maybe you could use the coefficients of the other terms as secrets as well. These two links talk about the details of that:
Cryptography Stack Exchange: Why only one secret value with Shamir’s secret sharing?
Cryptography Stack Exchange: Coefficients in Shamir’s Secret Sharing Scheme

Now that you understand this, you are probably ready to start reading up on elliptic curve cryptography. Give this link below a read if you are interested in a gentle introduction on that!
A (Relatively Easy To Understand) Primer on Elliptic Curve Cryptography

Comments

comments