raytracing, raymarching, and other things. Click here: Demofox on Shadertoy

Shadertoy is a really fun website where you can write a pixel shader in WebGL / GLSL and share it with your friends.

raytracing, raymarching, and other things. Click here: Demofox on Shadertoy

Shadertoy is a really fun website where you can write a pixel shader in WebGL / GLSL and share it with your friends.

If you are making a game, chances are you are loading assets and/or data from disk, and that you would like to load it in as fast as possible, so that players can play your game instead of staring at loading screens.

To make the fastest load possible, you want as few disk reads as you can get away with, you want them to be as large reads as possible, you want to read in the same order as the data is on disk, and if possible, you want to read from a single file that is kept open the whole time.

There are many reasons for why the above makes for ideal disk reading, but some of them are…

- Every time you do a disk read, if the disk is in use by another process, you have to wait for that process to be done with the disk before you can start your read. The fewer disk reads you do, the fewer times you possibly have to wait on something else to relinquish the disk. Doing as large reads as possible makes for fewer disk reads too.
- You want to read the data from disk in the same order that it’s stored on disk, because for some drive types (such as CDROM, DVD and hard disk drives) there is a physical “read head” that the drive has to move around to get to the data. The more that drive head has to move around, the longer you spend waiting for it to move, versus getting the data you want from the disk drive. Some drives, such as SSDs, have a zero seek time, but reading sequentially can also help out due to more efficient use of disk buffers.
- You want to read the data from a single file when possible, versus having multiple files, because opening a file is expensive, and doing multiple reads from multiple files has a lot more erratic performance than doing multiple reads from a single file. Also, due to the fact that we usually have no way of knowing how files are actually laid out on disk, putting all the data into a single file is the only way to be sure that you can read data in order to minimize read head movements between reads (when the file is not fragmented on disk).

So, how do you go from having a bunch of small reads from many different files, to fewer (or just 1) read from a single file?

Professionally made video games (especially console games) often will come up with a “packing process” to put all the files together into a single bundle, and possibly compress or encrypt it. Also, they’ll use this “pack time” opportunity to pre-calculate whatever they can that might be expensive at load time. For instance creating a Navigation Mesh or baking Lightmaps.

That is pretty cool, but is quite a bit of work. What if you don’t have the opportunity or willingness to make such a thing, or you want to get a sense of how much something like that would improve things?

Well, Paul, a buddy of mine, told me about a neat technique for doing this quickly and easily that after having heard it, I see references to ALL THE TIME now. It’s bizarre.

Basically, what you do is give your file read functions a special mode of operation where whenever they read data from disk, they also append that data to a special file.

After the load process is complete, you will then have a file that contains all the data your game wanted to read from disk during loading, in the order that it asked for it.

Next, you give your read and seek functions a special mode of operation where read will read from that special file, and seek will do nothing. You can open that file at the beginning of the load operation and close it at the end.

Following that, you are now a lot closer to ideal… you are reading from a single file, and you are reading sequentially.

For the rest of it (reading as few times as possible / doing as large reads as you can), if you have the RAM free, instead of having your disk read function read from the file, you could actually just load the whole file into RAM at once when loading starts (with a single disk read), and have your read function just serve data from memory. When loading is finished, you can free that memory from RAM, OR if you want to make subsequent loads faster (like if your game can have multiple plays in one session), you can keep it in memory so that the subsequent loads don’t even have to touch the disk drive.

Of course this assumes you have a deterministic read order, or that you can make it deterministic, etc etc, but it’s a pretty useful tool for the toolbox IMO.

Ok so as it turns out, the windows file cache (SuperFetch) uses a variation of this technique as well. Check it out!

From SuperFetch: How it Works & Myths

Let’s focus on decreasing boot times first. During the Windows boot process, the same files need to be accessed at different times. SuperFetch records which data and files need to be accessed at which times, and stores this data in a trace file. During subsequent boots, this information is used to make the loading of said data/files more efficient, resulting in shorter boot times.

SuperFetch performs more tasks to make the boot process more efficient. It also interacts with the defragmenter to make sure that the files accessed during the boot process are stored on the disk in the order they are accessed in. It performs this as a routine task every three days; the specific file layout is stored in /Windows/Prefetch/Layout.ini.

Here are some interesting tidbits of info that I’ve found useful for being able to think about matrix math in a more intuitive way. We start off with 2d matrix math but extend it to 3d at the end.

You might ask why you might go through all the trouble of using a matrix for doing transformations like translation, rotation, scaling and shearing / skewing.

Why not just manually modify the points, putting them through equations to get the results. Well, there are two main reasons.

The first one is for performance. The function for rotating a point in 2d looks like this:

x’ = x * cos(theta) – y * sin(theta)

y’ = x * sin(theta) + y * cos(theta)

If you have 1000 points, that means you are calculating sin twice and cos twice for each point which is 4000 trig operations. If you are smart (or your compiler is!), you’ll only do sin and cos once for each point, but that’s still 2000 trig operations.

If you are super duper smart (or your compiler is…), you’ll notice that theta is the same for all 1000 points, and perhaps you’ll calculate sin(theta) and cos(theta) once ahead of time and use those values for each point.

That last step is basically what matrix math does for you. A 2d rotation matrix looks like the below:

[ cos(theta), sin(theta) ]

[-sin(theta), cos(theta) ]

That means that once you have calculated your rotation matrix, you don’t need to keep performing trig operations. You have your values and can use them over and over very cheaply.

This especially saves processor time when you combine multiple transformations together. If you needed to perform an operation that did some crazy combination of 2000 rotations, 2000 translations and 2000 scale adjustments, instead of needing to do those 6000 operations on each point, you can just calculate the final matrix (by combining the matrix of each of those 6000 operations into one matrix) and then use that single matrix to your hearts content.

Another reason why you might want to use matrices instead of doing transforms by hand is that it’s a lot simpler writing code that does general transformations instead of deciding after the fact “hey i want to add scaling now and i need to touch all my transformation related code to implement it”.

Using a matrix, you don’t have to know or care what the transform is, it will just do it and you can move on with your life.

As a quick refresher, in 2d, a dot product is just 2 multiplies and an add (A.X * B.X + A.Y * B.Y) and in 3d, a dot product is just 3 multiplies and 2 adds (A.X * B.X + A.Y * B.Y + A.Z * B.Z).

A lot of modern hardware (both CPU and GPU) has varying amounts of support built in for vector and matrix math to make it faster, but matrix vs vector multiplies are really not that bad to begin with. In 2d, a matrix * vector operation is just 2 dot products! In 3d, it’s 3 dot products. Look at the below to see what I mean:

[VX VY]

*

[AX AY]

[BX BY]

=

VX’ = AX * VX + BX * VY

VY’ = AY * VX + BY * VY

and in 3d:

[VX VY VZ]

*

[AX AY AZ]

[BX BY BZ]

[CX CY CZ]

=

VX’ = AX * VX + BX * VY + CX * VZ

VY’ = AY * VX + BY * VY + CY * VZ

VZ’ = AZ * VX + BZ * VY + CZ * VZ

Let’s say that you are working in 2d and you want to rotate a point. Let’s say that for some reason you know what the rotated X and Y axis are supposed to be. You can actually create a rotation matrix from that knowledge alone without having to do any trig or geometry type math.

Like for instance, if you wanted an object’s x axis to point parallel to the vector of a moving object, and you knew the object’s normalized velocity vector (the direction it is moving, normalized to have a vector length of 1) was (0.34, 0.93).

That normalized velocity vector would be your X axis, and you could use the “faked 2d cross product” of flipping x and y and negating one to get a vector perpendicular to that X axis (aka your Y axis) . So…

X axis = ( 0.34, 0.93)

Y axis = (-0.93, 0.34)

(note, which one you negate on the Y axis matters only as much as you care which of the 2 directions you want the Y axis to point. Basically if you don’t care if it points up or down, it doesn’t matter which one you flip. If you do care, you need to pick the right one for the direction you want. This may not be the best way, but i do it by visual inspection, or by evaluating the math and seeing if it’s pointing in the way that i want or not. In other words… try it one way, and if it’s backwards, do it the other way.)

Now that we have an X and a Y axis, we just use the X axis as the first row, and the Y axis as the second row and get our rotation matrix:

[ 0.34, 0.93]

[-0.93, 0.34]

To see that it really works, try multiplying the vector (1,0) by that matrix to see if we get the right number out (it should be the same vector as the velocity of the object we are orienting to). We are basically verifying here that our X axis comes out to what it should.

[1 0]

*

[ 0.34, 0.93]

[-0.93, 0.34]

=

X’ = 0.34 * 1 – 0.93 * 0 = 0.34

Y’ = 0.93 * 1 + 0.34 * 0 = 0.93

now, let’s check our y axis

[0 1]

*

[ 0.34, 0.93]

[-0.93, 0.34]

=

X’ = 0.34 * 0 – 0.93 * 1 = -0.93

Y’ = 0.93 * 0 + 0.34 * 1 = 0.34

Note that when you put your X and Y axis basis vectors into the matrix, that they should be normalized, otherwise they will do strange things to your point – like introduce scaling and skewing.

The process works backwards too which is really handy. If you have some matrix of unknown rotation, you can get the basis vectors out the same way you put them in.

You might have already seen this when looking at the matrix from the last example

[ 0.34, 0.93]

[-0.93, 0.34]

The first row (0.34, 0.93) is the X axis, and the second row (-0.93, 0.34) is the Y axis.

One caveat to be aware of though is that if you are working with a “matrix from the wild” where you don’t know if it’s a rotation matrix only, or if it might have some other transforms in it (scaling or skewing), the rows may not be normalized.

If you know for sure that it’s just a rotation matrix, you can take the basis vectors right out of the matrix. If you don’t know for sure, you need to normalize the basis vectors after you pull them out.

Why is this useful?

If in 3d, you had the matrix representing the camera transform, you could grab the 3rd row to get the forward vector. You could use this vector when launching a projectile from the player’s position so that it would go where they were aiming.

Again, in 3d if you had the camera matrix, you could grab the first row to get the “left vector” and you could add or subtract that from the player’s position to do strafing left and right.

No complex math required (:

To be able to have a matrix that can do translation, you need to go to a 3×3 matrix.

Below is what a translation matrix looks like. (TX,TY) is the translation.

[1 0 0]

[0 1 0]

[TX TY 1]

When you want to transform a 2d point by a 3×3 matrix like the above, you need to use a 1 for the Z component. Let’s see what happens when we transform a 2d point by this 3×3 translation matrix.

[X Y 1]

*

[1 0 0]

[0 1 0]

[TX TY 1]

=

X’ = X * 1 + Y * 0 + 1 * TX

Y’ = Y * 0 + Y * 1 + 1 * TY

Z’ = 1 * 0 + 1 * 0 + 1 * 1

=

X’ = X + TX

Y’ = Y * TY

Z’ = 1

If you want to transform a 2d VECTOR (something that represents a direction, not a location) by a 3×3 matrix, you need to use a zero in the Z component instead of a 1. You may have heard this before, but let’s see why:

[X Y 0]

*

[1 0 0]

[0 1 0]

[TX TY 1]

=

X’ = X * 1 + Y * 0 + 0 * TX

Y’ = Y * 0 + Y * 1 + 0 * TY

Z’ = 1 * 0 + 1 * 0 + 0 * 1

=

X’ = X

Y’ = Y

Z’ = 0

As you can see, the vector was not affected by the translation of the matrix. If the 3×3 matrix had scaling and rotation in it as well as the translation, the vector WOULD be affected by those things as it should be. The translation is the only thing that doesn’t apply when you use a Z value of zero.

Note that if you want to get the 2d basis vectors from a 3×3 matrix, you just ignore the 3rd row and the 3rd column and do the same thing you would do with a 2×2 matrix. Again, making sure to normalize the basis vectors when it’s appropriate.

To combine transforms together (whether 2×2 or 3×3 matrices) you just multiply them together.

The order of matrix multiplication matters though. A * B is not the same as B * A.

To make things confusing, OpenGL and DirectX use different representations of matrices (one is “column major” the other is “row major”) which means that the matrices in each API are transposes of the other.

To make things even more confusing, if AT and BT are the transpose of A and B, then A * B = BT * BA. This means that premultiplication and postmultiplication (aka is A on the left or the right in A * B) swap meanings when going from one API to the other.

I’m not sure who is to blame for that one, but here’s an interesting link on the subject: http://steve.hollasch.net/cgindex/math/matrix/column-vec.html

Note that row vs column major matrices also change which direction the basis vectors are stored in… so instead of X axis being the 3 top numbers, it would be the 3 left numbers! It should be easy enough to tell which is which by inspection, but keep it in mind!

Anyways, I’ll continue to use the conventions set in this article above in the vector / matrix multiplication.

If you multiply a translation matrix by a rotation matrix, you’ll get a matrix that rotates a point, and then translates it.

[1 0 3]

[0 1 5]

[0 0 1]

*

[ 0.34 0.93 0]

[-0.93 0.34 0]

[ 0 0 1]

=

[ 0.34 0.93 3]

[-0.93 0.34 5]

[ 0 0 1]

If, however, you multiply a rotation matrix by a translation matrix, you’ll get a matrix that translates a point, then rotates it. Going that direction, the translated point is rotated.

[ 0.34 0.93 0]

[-0.93 0.34 0]

[ 0 0 1]

*

[1 0 3]

[0 1 5]

[0 0 1]

=

[ 0.34 0.93 5]

[-0.93 0.34 1.4]

[ 0 0 1 ]

Which way you multiply entirely depends on what it is you are trying to achieve. And, well… it also depends on whether you are dealing with row major or column major matrices!

Multiplying a 3×3 matrix by another 3×3 matrix is the same as doing nine 3d dot products.

Taking the transpose of a matrix doesn’t have any intuitive geometrical (or other) meaning that I’m aware of. I’ve looked on the net and all I could find was some “simple” explanations involving general relativity. Awesome right? LOL.

On the other hand, inverse matrices have a very intuitive and very useful meaning. Inverse matrices do the reverse of whatever the matrix does.

That means if you have a matrix that translates by (7,5) and then rotates by (45, 30) degrees, the inverse matrix will rotate by (-45, -30) degrees and then translate by (-7,-5).

This is super useful sometimes (:

Inverting a 2×2 matrix is actually really easy. I could explain it but you really ought to check out this page to see how. I recommend doing the exercises at the bottom to make sure you firmly understand how to do it!

http://www.mathsisfun.com/algebra/matrix-inverse.html

Inverting a 3×3 matrix is fairly easy too, but kind of tedious. Here’s a page that explains how:

http://www.wikihow.com/Inverse-a-3X3-Matrix

After you are done with that, here are some problems to run through to make sure you really do understand it:

https://www.khanacademy.org/math/algebra/algebra-matrices/inverting_matrices/e/matrix_inverse_3x3

Not all matrices are invertible. If you read the links and walk through the exercises, you’ll see why. Basically, an uninvertable matrix will cause a divide by zero in the inversion process. I believe this comes up when you have a matrix with parallel basis vectors, or if you have a zero scaling matrix (a multiply by zero can’t be reversed!) but i could be mistaken or not have the full picture there. If I’ve mis-spoken, post a comment!

Extending the above to 3d matrices is pretty simple and there aren’t really many suprises.

One difference is that in the above where we use a 2×2 matrix in 2d, we would use a 3×3 matrix in 3d because of the extra Z coordinate.

If you want to do translation in 3d, you have to use a 4v4 matrix instead of a 3×3 (just like in 2d how you had to move from a 2×2 to a 3v3, in 3d you have to move from a 3×3 to a 4×4). A 3d translation matrix looks like this:

[1 0 0 0]

[0 1 0 0]

[0 0 1 0]

[TX TY TZ 1]

Inverting a 4×4 matrix is pretty tedious, but follows the same patterns as inverting a 3×3 and a 2×2.

You can find info on a 3d rotation matrix here, if you don’t want to build it up with basis vectors: http://en.wikipedia.org/wiki/Rotation_matrix#In_three_dimensions

In our example above where we had the X axis and used a “fake 2d cross product” to get the perpendicular vector, when you move into 3d you’ll probably want to use the cross product to get perpendicular vectors.

Like for instance if you know 2 of the basis vectors, you can use cross product of those 2 to get the third.

Z = X x Y

If, however, you only have one basic vector (say “Z” because maybe you have a “camera direction”), you can use cross product to get 2 other vectors so long as you can make certain assumptions about the orientations involved. Like for instance, you might do this:

Fwd = normalize(Camera.Forward)

Left = normalize(Fwd x (0, 1, 0))

Up = normalize(Fwd x Left)

rotation matrix =

[Left.X Left.Y Left.Z]

[Up.X Up.Y Up.Z]

[Fwd.X Fwd.Y Fwd.Z]

The above only works if the camera can never look straight up, and it also assumes that your camera doesn’t have any roll – but it is a useful technique if those assumptions are ok.

That’s about it! I hope you found at least some of this information useful (:

if I missed anything you think belongs here, post a comment and share with the rest of us!

As a game developer there is just too much darn stuff to learn. You could spend your entire life learning things and never know it all.

It’s a blessing in that you are seldom bored, but also sometimes a curse in that there almost always is a better way to do something, and that you would know about it if you had spent your time learning X instead of Y 😛

I find that you sort of have to triage what you learn and what you choose to keep fresh in your mind, which can be a challenge sometimes. If you can find the commonalities between things that can help some – like understanding how encryption, hashing, pseudo random number generators and chaos theory all overlap – or how skeletal animation blending and audio synthesis are both trying to be continuous waves above all else. Also, if you put in the time investment to learn something to where it becomes intuitive, that frees up neurons to make room for other stuff. But, of course, we have a finite amount of time, so can’t always spend the time needed to get to that level on every single topic.

How about you… do you have to do a juggling act like this to keep sharp and stay effective as a game (or non-game) programmer? I’d be interested to hear how others deal with this sort of thing with such a large knowledge space that we work in.

In any case, I usually work with spherical or polar coordinates only on rare occasions, so whenever i do, the process usually is to google the equations, drop them in, and move on with my life. I was recently implementing an orbit camera for a raytracer on shadertoy.com (Raytraced Refraction) and when my copy/pasting wasn’t working, I was forced to take a deeper look at why it wasn’t working. Amazingly, this time around, it finally clicked and is now an intuitive thing so I figured I’d share the explanation that makes most sense to me in case it helps anyone else.

Polar coordinates have two components – a distance and an angle – and represent a point in 2d space.

The distance is called the radial coordinate, or the radius and the angle is called the angular coordinate or polar angle.

Just like you probably expect, the angle defines what direction the point is in, and the radius defines how far away it is. Super simple. Below is a picture of a polar coordinate point at (3, 45) where 3 is the distance and 45 is the angle.

So how do we convert that to rectangular coordinates? well, first thing to do is to convert the angle to rectangular coordinates on a unit circle to get a direction vector. Then, you multiply that direction vector by the radius to get the final coordinate.

To convert the angle to a point on a unit circle and get the direction vector it’s super simple…

X = cos(angle) Y = sin(angle)

For every point on the unit circle, it’s X coordinate is the cosine of the angle, and it’s Y coordinate is the sine of the angle.

Looking at the diagram below, see if you can figure out why arccosine only returns an angle between 0 and 180, and why arcsine only returns an angle between -90 and 90 (hint, what if i asked you to tell me what angle gives 0.7 in the x component). Also see if you can understand why sin(x)^2 + cos(x)^2 = 1 (hint: distance formula).

Ok so now that we can get our direction vector, we just need to multiply it by the radius. So… to convert from polar to rectangular (cartesian) coordinates, you do this:

X = cos(angle) * radius Y = sin(angle) * radius

So how do we convert from rectangular coordinates to polar?

Well, since we have the X and the Y coordinates, and we know that tangent(angle) = Y / X, we can use arctangent to get our angle. Unfortunately atan has a similar problem to asin and acos in that it doesn’t know which quadrant you are talking about. For instance, look at the diagram above again and tell me “which angle gives me a value of 1 when i divide the Y component by the X component?”. The answer is 45 degrees and 225 degrees both. This is because with a positive value of 1, we don’t know if X and Y were both negative, or if they were both positive. Similarly, if i asked which angle gave an answer of -1, we wouldn’t know if it was the X or Y that was negative.

Instead of using atan, we want to use atan2, which takes 2 parameters – Y and X – so that it can figure out the correct angle for you.

Next is the easy part of finding the radius. treating your point as a vector (or continuing to treat it like a vector if it IS a vector), the radius is just the magnitude of the vector (or distance from the origin if you want to think of it in “point” terms instead of vectors).

So, converting rectangular to polar coordinates is done like this:

radius = sqrt(X * X + Y * Y) angle = atan2(Y, X)

Spherical coordinates have the same components as polar coordinates, but then an added component: an angle which determines pitch / vertical rotation (think: looking up and looking down, instead of the polar angle which is in charge of looking left and right).

In math, they usually call the radius rho, the polar angle theta, and the azimuth angle phi, so a formal polar coordinate looks like this:

(rho, theta, phi)

For our examples let’s assume that X and Y make up the horizontal plane and that Z is the vertical (3d) axis.

If you are scared to make the jump from 2D polar coordinates to 3D spherical coordinates don’t be! The way to deal with these guys is to break the 3d problem into two 2d problems, using the exact same stuff as described above.

So, the first thing we want to do is completely ignore this new 3rd component phi and think back to our 2d case. We are also going to ignore the radius for now.

XTheta = cos(theta) YTheta = sin(theta)

This is our direction vector on the horizontal plane (same as the 2d case, not accounting for radius yet).

Next we want to pretend like we are looking at our 3d world from the side and use our phi angle to convert from polar to rectangular coordinates:

XPhi = cos(phi) YPhi = sin(phi)

One way to think of what this other angle phi means, is that it is controlling where in the unit sphere the theta circle sits. As the theta circle gets higher or lower on the sphere, it shrinks or grows. It’s only at zero angle that it has a radius of 1.0. So, calculating these values, YPhi represents how high on the sphere the theta circle should sit, and XPhi is how large the circle should be.

So, to combine the X,Y theta values and the X,Y Phi values, we use YPhi as the vertical component, and XPhi as a radius for the theta circle, which we can do like this:

X = XTheta * XPhi Y = YTheta * XPhi Z = YPhi

The above equation will give us a point on a unit sphere, so from here, we need to multiply in the radius and our equation becomes:

X = XTheta * XPhi * radius Y = YTheta * XPhi * radius Z = YPhi * radius

If we put sin and cos back in, instead of xtheta (etc), the equation becomes that familiar, and previously complex equation:

X = cos(theta) * cos(phi) * radius Y = sin(theta) * cos(phi) * radius Z = sin(phi) * radius

Hopefully the equation makes more sense now, and hopefully you can look at that and intuitively understand why those values are what they are.

To convert from spherical coordinates to rectangular, the first thing to do is to get the radius, which is done in the exact same way as in 2d. We just take the magnitude of the vector (aka the distance of the point from the origion) and we are done.

To get theta and phi, we do the same thing of separating this 3d problem into two 2d problems.

In fact, to get theta, we do the exact same thing as we do for polar coordinates! We use atan2(Y,X) to get our angle… THAT’S ALL!

So, we have this so far:

radius = sqrt(X * X + Y * Y + Z * Z) theta = atan2(Y, X)

How do we figure out phi? Well, if you said that we should do atan2(Z,Y) or atan2(Z,X) you were pretty close but it’s actually arccos(Z / radius).

The reason for this is because neither X, nor Y is the “X” component of the “phi 2d polar coordinate”. you’d have to take the length of the (X,Y) vector and use that if you wanted to use atan2 to calculate phi. Instead of calculating that vector length, we can instead use a value we already have. cosine is Y / hypotenuse length, and hypotenuse length is the radius (length of our vector), so we might as well use that radius we already have to be able to use arccos.

The final equations for converting rectangular to spherical are:

radius = sqrt(X * X + Y * Y + Z * Z) theta = atan2(Y, X) phi = acos(Z / radius)

More info / alternate forms available on wikipedia here:

From Cartesian to Spherical Coordinates

Spherical Coordinate System

At first i said the answer was to check this out: Diehard Battery of Tests of Randomness which is linked to by this page which may also be of interest: Tests for Random Number Generators.

But apparently that is the “old way” and there is a new program from NIST that you can get here: NIST test suite for random numbers, which subsequently is linked to from random.org: Random.org Statistical Analysis.

Getting that program from NIST to compile was a little bit of a chore for me on msvc 2010. The biggest hurdle i hit was that msvc 2010 doesnt have erf() and erfc() so i had to google “erf.cpp” and find an implementation. If you can’t find one, erf and erfc are part of gcc which is open sourced so you can always go that route if you need to!

After compiling, i was able to run the test on my numbers but couldn’t make much sense of the results very easily. There were a few p scores and presumably some chi squared scores somewhere, but the “summary file” was very cryptic (pun intended) so i wasn’t really sure…

Anyways, just wanted to put it here for myself and others if anyone’s looking for this in the future 😛

Thanks to my buddy James for the correction and links to the newer NIST program. Thanks man!

Interestingly, the tests above use the source number data to do a bunch of different things, and then measure the statistics of the results.

For instance, it will use the numbers to shuffle a deck of cards, and then it will play poker and see if there is any bias of cards dealt, or players winning.

Or, it will use the numbers as the source of numbers for a roulette wheel and see if players win at the right rate statistically.

I guess the bottom line lesson for testing random numbers is that you should use the numbers how you intend to use them, and see if there’s any statistical anomalies.

There doesn’t seem to be a magic bullet test that works for generic randomness, but I’m guessing it’s just sort of… check for patterns in every way you can, or every way you care about, and if you don’t find any, consider it random. If you are using it for purposes where randomness really matters – like security or gambling – you then hope nobody else finds a pattern you didn’t! 😛

On that topic, check this out: Wikipedia: Michael Larson

A buddy at work said “I wish C++ had a two pass pre processor so that we could do external header guards”. It got me thinking about some random macro stuff i had seen before and i thought “hrm… you know, that actually might be possible to do… i’m going to give it a try”.

I ended up working something up tonight at home that’s semi-palatable. The way you use it is a little weird, but i think it satisfies the spirit of the challenge, and works as a proof of concept that you can do external header guards without having to type a bunch of stuff.

If you can think of a way to improve it, post a comment or something and let me know!!

Have you ever seen something like the below? it’s called a header guard:

#ifndef BLAH_H #define BLAH_H // code goes here #endif BLAH_H

You might also have seen this variation:

#pragma once

Without the header guards, if you include the header file twice, it will complain that the classes etc have already been defined. Those make sure that doesn’t happen, by only including the contents of the file if it hasn’t already been included.

External header guards on the other hand would be guards in the place it’s included instead of in the header file itself. That is more typing (more work), but the benefit there is that the compiler doesn’t have to open the header file at all to see if it’s already been included, which could make for faster compile times in large projects.

Anyways, here’s the code:

// include testoneblah.h which defines the typedef ProofIncluded__blah_h // to prove that it was really included #define FILESEQ (testone)(blah) #include "Includer.h" // try to include testoneblah.h again. It won't get included again, and // instead, ProofIncludeBlocked__blah_h will get typedef'd by includer.h // to prove that the file did not get included. Comment these lines out // and you'll get a compiler error that ProofIncludeBlocked__blah_h is // an undeclared identifier #define FILESEQ (testone)(blah) #include "Includer.h" int main(int argc, char **argv) { ProofIncluded__blah_h a; ProofIncludeBlocked__blah_h b; return 0; }

So it is a little weird… but to include a file, you define FILESEQ with a directory and filename (without .h on it), and then include “Includer.h”. Even though it’s weird to use, and doesn’t work for .inl files (and maybe other issues, some easily solved), it’s only one extra line of typing to do an external header guard, which is about as good as you can expect.

Ideally I wish the interface were like the below, but I haven’t been able to figure out how to make that work unfortunately.

IncludeFile_((testone)(blah))

//===================================================================================================== // Rip off boost, hooray!! boost_pp is really nice, you can just grab it from the boost bundle and // start using it because it's just a bunch of includes. you don't need to build or link with boost // at all. It's really nice. http://www.boost.org/ //===================================================================================================== # define BOOST_PP_EMPTY() # define BOOST_PP_SEQ_ELEM(i, seq) BOOST_PP_SEQ_ELEM_I(i, seq) # define BOOST_PP_SEQ_ELEM_I(i, seq) BOOST_PP_SEQ_ELEM_II((BOOST_PP_SEQ_ELEM_ ## i seq)) # define BOOST_PP_SEQ_ELEM_II(res) BOOST_PP_SEQ_ELEM_IV(BOOST_PP_SEQ_ELEM_III res) # define BOOST_PP_SEQ_ELEM_III(x, _) x BOOST_PP_EMPTY() # define BOOST_PP_SEQ_ELEM_IV(x) x # define BOOST_PP_SEQ_ELEM_0(x) x, BOOST_PP_NIL # define BOOST_PP_SEQ_ELEM_1(_) BOOST_PP_SEQ_ELEM_0 //===================================================================================================== #define EB_COMBINETEXT(a, b) EB_COMBINETEXT_INTERNAL(a, b) #define EB_COMBINETEXT_INTERNAL(a, b) a ## b #define EB_TOTEXT(a) EB_TOTEXT_INTERNAL (a) #define EB_TOTEXT_INTERNAL(a) #a //===================================================================================================== // extract the directory and file name #define DIR BOOST_PP_SEQ_ELEM(0, FILESEQ) #define FILE BOOST_PP_SEQ_ELEM(1, FILESEQ) // create the full file name: /.h #define THEFILENAME EB_TOTEXT(EB_COMBINETEXT(DIR, EB_COMBINETEXT(/, EB_COMBINETEXT(FILE, .h)))) #if !EB_COMBINETEXT(__, EB_COMBINETEXT(FILE, _h)) //including file: YES //include the file #include THEFILENAME #else //including file: NO //create a typedef called ProofIncludeBlocked___H to prove we didn't include the file typedef char EB_COMBINETEXT(ProofIncludeBlocked__, EB_COMBINETEXT(FILE, _h)); #endif // clean up the things we created. boost macros can stick around ::shrug:: #undef THEFILENAME #undef DIR #undef FILE // defined by caller, but we're cleaning it up for convenience #undef FILESEQ

I ripped some macros out of the boost preprocessor library (boost_pp) to help things out a little bit. In a nutshell what we are doing is this…

We test to see if the preprocessor value __<File>_h is not true (false, or undefined). If that is the case, we include the file <Directory>/<File>.h. Else, we define a typedef ProofIncludeBlocked<File>_h to prove that we blocked the include from happening.

#define __blah_h 1 class ProofIncluded__blah_h {};

Blah.h defines __blah_h as 1 (true). It’s important that it uses the same naming convention as Includer.h does (__<File>_h), otherwise this setup won’t work. If you screw it up, you’ll get compile errors about multiply defined symbols.

This file also defines a class ProofIncluded__blah_h to prove that this file actually got included, and also defines something that will complain if the file is included twice.

So, this is just a proof of concept and it has some issue including…

- Duplicate file names – if you have the same file name in different folders, this setup has issues. It might be able to be helped by including the directory name into the header guard preprocessor symbol.
- Referencing the same file different ways – if you reference the same file in different ways because there’s multiple ways to reach it in the include search paths, it won’t be able to tell that it’s the same file if you do the last fix. Maybe the real solution is to have another parameter defined specifying the header guard symbol? dont know…
- Only supports .h files – It assumes a .h extension, but maybe another parameter could be the file extension to use so you could include .inl, .hpp etc.

Hopefully you find it interesting at least though (:

Poday made some good points in the comments about it not being worth it, and my friend Doug also has this to say:

It’s not needed though all compilers already do that:

(GCC):

The GNU C preprocessor is programmed to notice when a header file uses this particular construct and handle it efficiently. If a header file is contained entirely in a `#ifndef’ conditional, then it records that fact. If a subsequent `#include’ specifies the same file, and the macro in the `#ifndef’ is already defined, then the file is entirely skipped, without even reading it.(Clang)

The MultipleIncludeOpt class implements a really simple little state machine that is used to detect the standard “#ifndef XX / #define XX” idiom that people typically use to prevent multiple inclusion of headers. If a buffer uses this idiom and is subsequently #include‘d, the preprocessor can simply check to see whether the guarding condition is defined or not. If so, the preprocessor can completely ignore the include of the header.(Clang still)

clang_isFileMultipleIncludeGuarded – Determine whether the given header is guarded against multiple inclusions, either with the conventional #ifndef/#define/#endif macro guards or with #pragma once.For MSVC all I could find is Herb Sutter lead architect for MSVC and head of the C++ committee in his book ‘C++ Coding Standards: 101 Rules, Guidelines, and Best Practices’:

24. Always write internal #include guards. Never write external #include guards.

With a reason of:

Don’t try to be clever: Don’t put any code or comments before and after the guarded portion, and stick to the standard form as shown.

Today’s preprocessors can detect include guards, but they might have limited intelligence and expect the guard code to appear exactly at the beginning and end of the header.

If you are a C/C++ programmer, you are likely familiar with malloc() and free(), the predecessors to C++’s new and delete operators, as well as the existence of the variations of malloc such as calloc, realloc and alloca.

If you are like me, you probably thought for a long while that malloc and it’s variations were relics of days gone by, only actually useful in a few very limited situations. Some of these guys still have use though, and don’t really have equivalents in C++ to replace them.

First the boring ones…

malloc – Allocates memory. Precursor to new operator.

calloc – Allocates memory and sets the contents to zero. C’s answer to the problem of uninitialized memory that constructors solve in C++.

Now the more interesting ones!

Believe it or not, alloca actually allocates memory on the stack. When your function goes out of scope, the stack memory is automatically returned to the stack due to the nature of how the stack and stack pointer work. No need to free the memory allocated with alloca, and in fact if you tried, you’d probably get a crash 😛

If you are a programmer who writes high performance applications, you are probably familiar with the benefits of using the stack instead of allocating memory on the heap with new or malloc.

The benefits of using the stack include…

- Quicker allocations – Allocating memory can be a relatively costly operation in terms of time, especially if you have multiple threads running using the same (thread safe) allocator. Allocating memory on the stack is essentially the same cost as defining a local variable. Under the hood, it’s just moving the stack pointer a little farther and gives you that empty space to use.
- No memory leaks – when the function you’ve allocated the stack memory in exits, that memory is automatically freed. This is because the stack pointer just “moves back” to what it used to be. There is not really any memory to free.
- Less memory fragmentation – When mixing large and small memory allocations and frees, sometimes you end up with your memory in a state where there is a lot of memory free, but just not all together in one place. For instance, your program might need to allocate 50MB, and there may be 300MB free on the heap total, but if there are small 16 byte allocations littered in the memory every 10MB, your program won’t be able to find a single 50MB region to allocate and the allocation will fail. One common cause of this problem is small allocations used for things like relatively small arrays or small string buffer allocations that exist temporarily to copy or transform some data, but are not meant to stick around very long. If you can put these on the stack instead of the heap, those small allocations don’t hit the heap, and your memory will be less fragmented in the end.
- Increased performance (fewer cache misses) – the contents of the stack are likely already in the CPU cache, so putting your data there means less information for the CPU to have to gather from RAM which is a slow operation.

However, there are some dangers when allocating memory on the stack as well

- If you keep a pointer to the memory, that memory could be “freed” and re-used, getting filled with other random data (local variables). That can cause crashes, memory corruption or other strange program behavior.
- If you allocate too much on the stack you could run out of stack space. The stack isn’t really meant to hold large amounts of allocated data. You can adjust your programs stack size though if this is a route you want to pursue.

There are some common techniques I’ve seen people use in places that could have also used alloca instead. These include…

- Small Pool Allocators – To get around the memory fragmentation problem, sometimes people will have different memory allocators based on the size of memory being allocated. This way, small temporary allocations for things like temporary string buffers will all be allocated from one place, while larger allocations for things like textures will be allocated elsewhere. This dramatically improves the memory fragmentation issue.
- Object Pools – Object pools are similar to small pool allocators but they work by allocating some amount of memory for specific types of objects, and have a way to remember which objects are used and which ones are free. For instance, you may dynamically allocate an array of 100 SMyStruct objects and have a flag for each to know which ones are in use and which ones aren’t. This way, the program can ask for a new object, and it can find one currently not in use and return it to the caller without needing to hit the ACTUAL memory allocator to get the data (unless all objects are spoken for, at which point it can choose to fail, or allocate a new “page” of objects to be able to hand out). This also has an interesting side effect that cache misses can drop quite a bit since the same kinds of objects will be nearer to eachother in memory.
- DIY Stack Allocator – When I was working at Midway, a friend (Hi Shawn!) profiled the animation code and found that a lot of time was spent in allocating temporary buffers to blend bone data together. To fix this, he rolled his own stack allocator, where there was one contiguous piece of memory on the heap that could be allocated from. There was an internal index keeping track of where the “top of the stack” was, and when memory was allocated, that stack index would just move up by however many bytes were asked for. At the end of the frame, the stack index was reset to zero, thus “freeing” the memory. This dramatically improved the animation system performance by making the temporary bone blend buffer allocations essentially free.
- Thread Specific Memory – If you are having problems where multiple threads are trying to allocate memory at the same time, causing contention and slowdowns due to thread synchronization, another option is to give each thread it’s own chunk of memory and let it allocate from that. That way there is no contention and you won’t have the slowdown of thread synchronization due to memory allocation anymore. A problem here though can be figuring out how much memory each thread needs. One thread may need a lot of memory, and another thread may need none, and you may not have any way of knowing which in advance. In this case, you’d have to allocate “a lot” of memory for each thread in advance, and pay an extra cost in memory that you technically don’t have to. But hey, at least it’s fast, maybe the trade off is worth it in your situation!

Lastly, there’s another common trick to avoid dynamic allocations involving templates, check it out!

// define the CStaticArray class template class CStaticArray { public: T m_values[N]; // you could put functions in here to do operations on the array data to make it look more like a standard // data type, instead of a plain vanilla array unsigned int Count () { return N; } void SomeOtherFunction () { } }; void MyFunc () { // make an array of 32 floats CStaticArray m_floatArray; // make an array of 128 SSomeStructs CStaticArray m_objectArray; for (unsigned int index = 0; index < m_objectArray.Count(); ++index) { m_objectArray.m_values[index].DoSomething(); } }

The above really shines if you have a standard API for strings or dynamic arrays in your code base. You can make a version like the above which works without dynamic allocations, but gives the same interface so it's easier for fellow programmers to use and swap in and out as needed.

Another nice benefit to the above technique is that it works for stack allocations, but you can also make them member variables of other objects. In this way, you can minimize dynamic allocations. Instead of having to dynamically allocate an object, and then dynamically allocate the array inside of it, you do a single allocation to get all the memory needed.

That is the closest thing in C++ that I've seen to alloca, but even so, alloca has the advantage that you can decide how much memory to allocate at run time. With the template method, you have to know at compile time which is fine for a lot of cases, but othertimes is a deal breaker, forcing you to have to go back to dynamic allocations (or perhaps now, alloca instead?)

Realloc is the other interesting memory allocation function.

Like I was mentioning above, the fewer allocations you can do, the better off you are in terms of performance, and also memory fragmentation.

By writing smart containers (dynamic arrays, dynamic strings, etc) you can make it so when someone tries to make a container smaller, that instead of allocating new memory that’s smaller, copying the data over, and freeing the old memory, that instead it just remembers the new size but keeps the old, larger memory around.

Then later on, if the container was told to grow, if it was smaller than the larger size from the past, it could just use some of that old memory again.

However, if that container grows larger than it used to be, you are going to have to allocate, copy, and free (costly etc) to grow the container.

Down in the guts of your computer however, there may be memory right after the current memory that’s not being used by anything else. Wouldn’t it be great if you could just say “hey… use that memory too, i don’t want to reallocate!”.

Well, realloc does ALL of the above for you without you having to write special code.

When you realloc memory, you give the old pointer and the new size, and if it’s able to, it won’t do any allocations whatsoever, and will just return you your old pointer back to you. It may allocate the next memory block for you if the new size is larger, but would still return the old pointer value in this case. Or, if the new amount of memory is smaller, it may return you back the same memory without doing anything internally (it depends on your compiler’s specific implementation of realloc what it does when)

If realloc does have to allocate new memory though, it will copy over all the old data to the new memory that it returns to you and free the old memory. So, you don’t have to CARE whether the pointer returned is old or new, just store the return value and continue on with your life.

It’s pretty cool and can help reduce actual memory allocations, lowering memory fragmentation and increasing performance.

In the first part of this blog post (Is pre-increment really faster than post increment? Part 1) I showed that it really doesn’t seem to matter if you use post or pre increment with simple integer types.

I then promised an example of where the choice of pre or post increment DOES matter, and here it is.

The long and the short of it is this…

- When you pre increment a variable, you are changing the value of a variable, and the new value will be used for whatever code the pre increment is part of.
- When you post increment a variable, you are changing the value of a variable, but the OLD value is used for whatever code the post increment is part of.

To make post increment work that way, it essentially needs to make a copy of the variable before the increment and return the copy for use by the code of which the post increment is a part of.

A pre increment has no such need, it modifies the value and everything uses the same variable. There is no copy needed.

The compiler / optimizer apparently does a good job of figuring out when it does or does not need to make a copy of integral types, but it doesn’t do as well with complex objects. Here’s some sample code to demonstrate this and the output that it generates in both debug and release.

#include #include #include //============================================================= class CScopeMessage { public: CScopeMessage(const char *label) { PrintIndent(); printf("%srn", label); s_indentLevel++; } CScopeMessage(const char *label, int objectId) { PrintIndent(); printf("%s obj %irn", label, objectId); s_indentLevel++; } CScopeMessage(const char *label, int objectId, int copyObjectId) { PrintIndent(); printf("%s (obj %i) to obj %irn", label, objectId, copyObjectId); s_indentLevel++; } ~CScopeMessage() { s_indentLevel--; } static void StartNewTest() { s_indentLevel = 0; s_lineNumber = 0; printf("rn"); } private: void PrintIndent() { s_lineNumber++; printf("%2i: ", s_lineNumber); for(int index = 0; index < s_indentLevel; ++index) printf(" "); } private: static int s_indentLevel; static int s_lineNumber; }; int CScopeMessage::s_indentLevel = 0; int CScopeMessage::s_lineNumber = 0; //============================================================= class CTestClass { public: CTestClass() { m_objectID = s_objectID; s_objectID++; // this is just noise in the test, but feel free to // comment out if you want to see for yourself //CScopeMessage msg("Constructing", m_objectID); m_value = new char[4]; strcpy(m_value, "one"); } CTestClass(const CTestClass &other) { m_objectID = s_objectID; s_objectID++; CScopeMessage msg("Copy Constructing", other.m_objectID, m_objectID); m_value = new char[strlen(other.m_value) + 1]; strcpy(m_value, other.m_value); } ~CTestClass() { CScopeMessage msg("Destroying", m_objectID); delete[] m_value; } // preincrement CTestClass &operator++() { CScopeMessage msg("Pre Increment", m_objectID); DoIncrement(); return *this; } // postincrement CTestClass operator++(int) { CScopeMessage msg("Post Increment", m_objectID); CTestClass result(*this); DoIncrement(); return result; } void DoIncrement() { CScopeMessage msg("Doing Increment", m_objectID); } private: char *m_value; int m_objectID; static int s_objectID; }; int CTestClass::s_objectID = 0; //============================================================= int main (int argc, char **argv) { CTestClass test; { CScopeMessage msg("--Post Increment--"); test++; } CScopeMessage::StartNewTest(); { CScopeMessage msg("--Post Increment Assign--"); CTestClass testB = test++; } CScopeMessage::StartNewTest(); { CScopeMessage msg("--Pre Increment--"); ++test; } CScopeMessage::StartNewTest(); { CScopeMessage msg("--Pre Increment Assign--"); CTestClass testB = ++test; } system("pause"); return 0; }

Here’s the debug output:

You can see that in the post increment operator, it calls the copy constructor not once but twice! The first copy constructor is called to create the “result” object, and the second copy constructor is called to return it by value to the caller.

CTestClass operator++(int) { CScopeMessage msg("Post Increment", m_objectID); CTestClass result(*this); DoIncrement(); return result; }

Note that it can’t return the copy by reference because it’s a local variable. C++11’s “std::move” and xvalue type functionality is there to help with this stuff, but if you can’t use that tech yet, it isn’t much help hehe.

Interestingly, we can see that 2 copy constructors get called whether or not we assign the value returned or not.

On the pre-increment side, you can see that it only does a copy construction call if you assign the result. This is nice and is what we want. We don’t want extra object copies or memory allocations and deallocations.

Things are a little bit better in release, but not by much. The optimizer seems to have figured out that it doesn’t really need to do 2 object copies, since it only ever wants at most one REAL copy of the object, so it gets away with doing one object copy in both situations instead of two.

That’s an improvement, but still not as good as the pre-increment case which hasn’t visibly changed in release (not sure about the assembly of these guys, but if you look and find something interesting, post a comment!)

As always, you should check your own assembled code, or test your compiler with printf output like this post does to ensure you really know what your code is doing.

But… it seems like you might want to use pre-increment if you ever use increment operators for heavy weight objects (such as iterators), but if you want to use post increment for integral types, it ought to be fine.

That said, a lot of people say “just use pre-increment” because at worst it’ll be the same as a post increment, but at best it will be a lot more efficient.

You do whatever you want, so long as you are aware of the implications of going one way or the other 😛

When I posted the last blog post about shuffling on the GameProgrammer.com mailing list, someone responded back with a super tiny random number generator that is actually pretty damn good. It is this:

x+=(x*x) | 5;

The high bit of X is the source of your random numbers, so if you want to generate an 8 bit random number, you have to call it 8 times. Apparently it passes a lot of tests for randomness really well and is a pretty high quality PRNG. Check this out for more info: http://www.woodmann.com/forum/showthread.php?3100-super-tiny-PRNG

You can start x at whatever you want, but it might take a few iterations to “warm up” especially if you start with a small seed (ie you may want to throw away the first 5-10 random bits it generates as they may not be that random). I adapted it into an example program below, along with some example output. I use time() to set the initial value of x.

#include #include #include #include // A super tiny prng // http://www.woodmann.com/forum/showthread.php?3100-super-tiny-PRNG // unsigned int seed = 0; unsigned int GenerateRandomBit() { seed += (seed * seed) | 5; return seed & 0x80000000; } template void GenerateRandom(T& value) { memset(&value, 0, sizeof(T)); const unsigned int numBits = sizeof(T) * 8; unsigned int* dataPointer = (unsigned int *)&value; for (unsigned int index = 0; index < numBits; ++index) { if(GenerateRandomBit()) { unsigned int pointerIndex = index / 32; unsigned int mask = 1 << index % 32; dataPointer[pointerIndex] |= mask; } } } int main(int argc, char **argv) { seed = (unsigned int)time(NULL); printf("seed = %urn", seed); printf("9 random uints...rn"); for (unsigned int index = 0; index < 9; ++index) { unsigned int random; GenerateRandom(random); printf("%2u: %10u (%x)rn", index, random, random); } printf("3 random floats...rn"); for (unsigned int index = 0; index < 3; ++index) { float f; GenerateRandom(f); printf("%2u: %f (%x)rn", index, f, *((unsigned int*)&f)); } printf("8 random characters...rn"); char text[8]; GenerateRandom(text); for (unsigned int index = 0; index < 8; ++index) { printf("%2u: %crn", index, text[index]); } system("pause"); return 0; }

In this post I’m going to show a way to make an iterator that will visit items in a list in a random order, only visit each item once, and tell you when it’s visited all items and is finished. It does this without storing a shuffled list, and it also doesn’t have to keep track of which items it has already visited.

This means you could have a list that is a million items long and no matter how many you have visited, it will only take two uint32s to store the current state of the iterator, and it will be very fast to get the next item in the list (somewhere around 1 to 4 times the cost of calling the rand() function!).

This is a follow up post to an older post called Fast & Lightweight Random “Shuffle” Functionality.

In that older post, things didn’t work quite like I expected them to so it was back to the drawing board for a little while to do some research and experimentation to find a better solution. I’m not going to go back over the basics I talked about in that article so go back and have a read there first if anything is confusing.

In the last post on this topic we talked about how the high level goal was to map the index to a random number, and because we were randomizing the bits in a deterministic (and non destructive) way, we needed to iterate over the whole “next power of 2” items and reject any that were too large. Only doing this could we be sure that we visited every index. The problem I hit last time though was that I could not get the numbers to look random enough.

To solve this, i decided what i needed to do was ENCRYPT the index with a block cipher. When you encrypt data, it should come out looking like random data, even though the data you put in may be sequential or have other easily seen patterns. What else is great, is that when using a block cipher, each unique input should equate to a unique output which means that if we encrypt the full power of 2 range as input, we will get the full power of 2 range out as output, but just in a different order.

Once I realized this was a good solution, my next problem to tackle was that I knew of no block algorithms that would work for a variable number of bits. There are block cipher algorithms that will work for LARGE number of bits, but there is no algorithm I knew of where you can tell it “hey, i only want to use 4 bits” or “i only want to use 5 bits”.

In the end the answer I found was to roll my own, but use existing, well established technology to do so. In my specific case, I’m also aiming for high speed functions since I want this functionality used in real time gaming applications.

What I came up with in the end is not cryptographically secure, but using the same techniques I have laid out, you should be able to drop in a different block cipher algorithm if that is a requirement.

As it turns out, there is a basic primitive of cryptography called a Feistel Network. It’s used by quite a number of modern ciphers and hash functions, and it is surprisingly simple.

For a balanced Feistel Network, you split the data into a left and a right side, and do the following, which consists of a single round (you can do as many rounds as you like):

Left[i+1] = Right[i]; Right[i+1] = Left[i] ^ RoundFunction(Right[i], key);

After performing however many rounds you wish, you combine the left and the right sides again to get your encrypted data.

To unencrypt, the feistel network works much the same but only in reverse, looking like the below:

Right[i] = Left[i+1]; Left[i] = Right[i+1] ^ RoundFunction(Left[i+1], key);

Check out the wikipedia page if you are interested in more info.

The neat thing about Feistel Networks is that the round function can be any deterministic function that performs whatever operations it wants – even destructive and irreversible operations such as division or bit shifts. Even so, the feistel network as a whole is reversible, no matter what you do in the round function, and you can unencrypt to get your origional data back.

This threw me for quite a loop and I couldn’t get my head around why this worked for a long while until I found a webpage that explained it pretty well. unfortunately I lost the link and cannot find it again but the basic idea is this… For each round of encryption, the right side is encrypted using the key and the left side. This means that at any point, no matter how many rounds you’ve done on your data, the right side should be able to be decrypted using the key and the left side. If you have the key, and you know how many rounds were used in encryption, you have all the data you need to decrypt it again. Hopefully that makes sense… I had to work it out on paper a little bit to see it fully.

The other great thing about Feistel Networks is that you can make them be however many bits you want. So, if i want each side of the Feistel Network to be 1 bit, I can do that. Or, if i want each side to be 128 bits, I can do that too!

You can also tune the quality / performance a bit by doing less or more rounds.

BTW the Tiny Encryption Algorithm uses a Feistel Network if you want to see a simple example in action.

With the “variable bit size support” problem solved, next I needed to come up with a round function that did a good job of taking sequential numbers as input and spitting out seemingly random numbers as output. Thanks to what I was saying before, the round function doesn’t need to be reversible, so there are a lot of options available.

I ended up deciding to go with a hash function, specifically Murmur Hash 2 (which I actually also used in my last post if you’d like to see some more information on it! The Incredible Time Traveling Random Number Generator).

Since the hash spits out numbers that might be anything in the range of an unsigned int, but I only want N bits, I just AND the hash against a mask to get the number of bits I want. There’s probably a higher quality method of smashing down the bits using XOR or something, but my quality needs aren’t very high so I just opted to AND it.

A downside of going with the balanced Feistel Network approach is that before this, I only had to round up to the next power of 2, but now, since each half of the data needs to be a power of 2, I actually have to make sure I have an even number of bits and have to round up to the next power of 4. This means that when it’s looking for valid indices to return in the shuffle, it may have to calculate up to 4 different indices on average before it finds a valid one. Not the greatest thing in the world, but also not the worst and definitely not a deal breaker in my eyes.

At long last, here is the code! Use it in good health (:

There are some example runs of the program below it as well.

#include #include #include // MurmurHash code was taken from https://sites.google.com/site/murmurhash/ //----------------------------------------------------------------------------- // MurmurHash2, by Austin Appleby // Note - This code makes a few assumptions about how your machine behaves - // 1. We can read a 4-byte value from any address without crashing // 2. sizeof(int) == 4 // And it has a few limitations - // 1. It will not work incrementally. // 2. It will not produce the same results on little-endian and big-endian // machines. unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed ) { // 'm' and 'r' are mixing constants generated offline. // They're not really 'magic', they just happen to work well. const unsigned int m = 0x5bd1e995; const int r = 24; // Initialize the hash to a 'random' value unsigned int h = seed ^ len; // Mix 4 bytes at a time into the hash const unsigned char * data = (const unsigned char *)key; while(len >= 4) { unsigned int k = *(unsigned int *)data; k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; data += 4; len -= 4; } // Handle the last few bytes of the input array switch(len) { case 3: h ^= data[2] << 16; case 2: h ^= data[1] <> 13; h *= m; h ^= h >> 15; return h; } struct SShuffler { public: SShuffler(unsigned int numItems, unsigned int seed) { // initialize our state m_numItems = numItems; m_index = 0; m_seed = seed; // calculate next power of 4. Needed sice the balanced feistel network needs // an even number of bits to work with m_nextPow4 = 4; while (m_numItems > m_nextPow4) m_nextPow4 *= 4; // find out how many bits we need to store this power of 4 unsigned int numBits = 0; unsigned int mask = m_nextPow4 - 1; while(mask) { mask = mask >> 1; numBits++; } // calculate our left and right masks to split our indices for the feistel // network m_halfNumBits = numBits / 2; m_rightMask = (1 << m_halfNumBits) - 1; m_leftMask = m_rightMask << m_halfNumBits; } void Restart() { Restart(m_seed); } void Restart(unsigned int seed) { // store the seed we were given m_seed = seed; // reset our index m_index = 0; } // Get the next index in the shuffle. Returning false means the shuffle // is finished and you should call Restart() if you want to start a new one. bool Shuffle(unsigned int &shuffleIndex) { // m_index is the index to start searching for the next number at while (m_index < m_nextPow4) { // get the next number shuffleIndex = NextNumber(); // if we found a valid index, return success! if (shuffleIndex 1) { // get the last number shuffleIndex = LastNumber(); // if we found a valid index, return success! if (shuffleIndex > m_halfNumBits; unsigned int right = (index & m_rightMask); // do 4 feistel rounds for (int index = 0; index < 4; ++index) { unsigned int newLeft = right; unsigned int newRight = left ^ (MurmurHash2(&right, sizeof(right), m_seed) & m_rightMask); left = newLeft; right = newRight; } // put the left and right back together to form the encrypted index return (left << m_halfNumBits) | right; } private: // precalculated values unsigned int m_nextPow4; unsigned int m_halfNumBits; unsigned int m_leftMask; unsigned int m_rightMask; // member vars unsigned int m_index; unsigned int m_seed; unsigned int m_numItems; // m_index assumptions: // 1) m_index is where to start looking for next valid number // 2) m_index - 2 is where to start looking for last valid number }; // our songs that we are going to shuffle through const unsigned int g_numSongs = 10; const char *g_SongList[g_numSongs] = { " 1. Head Like a Hole", " 2. Terrible Lie", " 3. Down in It", " 4. Sanctified", " 5. Something I Can Never Have", " 6. Kinda I Want to", " 7. Sin", " 8. That's What I Get", " 9. The Only Time", "10. Ringfinger" }; int main(void) { // create and seed our shuffler. If two similar numbers are hashed they should give // very different results usually, so for a seed, we can hash the time in seconds, // even though that number should be really similar from run to run unsigned int currentTime = time(NULL); unsigned int seed = MurmurHash2(¤tTime, sizeof(currentTime), 0x1337beef); SShuffler shuffler(g_numSongs, seed); // shuffle play the songs printf("Listen to Pretty Hate Machine (seed = %u)rn", seed); unsigned int shuffleIndex = 0; while(shuffler.Shuffle(shuffleIndex)) printf("%srn",g_SongList[shuffleIndex]); system("pause"); return 0; }