Demystifying Floating Point Precision

Floating point numbers have limited precision. If you are a game programmer, you have likely encountered bugs where things start breaking after too much time has elapsed, or after something has moved too far from the origin.

This post aims to show you how to answer the questions:

  1. What precision do I have at a number?
  2. When will I hit precision issues?

First, a very quick look at the floating point format.

Floating Point Format

Floating point numbers (Wikipedia: IEEE 754) have three components:

  1. Sign bit – whether the number is positive or negative
  2. Exponent bits – the magnitude of the number
  3. Mantissa bits – the fractional bits

32 bit floats use 1 bit for sign, 8 bits for exponent and 23 bits for mantissa. Whatever number is encoded in the exponent bits, you subtract 127 to get the actual exponent, meaning the exponent can be from -126 to +127.

64 bit doubles use 1 bit for sign, 11 bits for exponent and 52 bits for mantissa. Whatever number is encoded in the exponent bits, you subtract 1023 to get the actual exponent, meaning the exponent can be from -1022 to +1023.

16 bit half floats use 1 bit for sign, 5 bits for exponent and 10 bits for mantissa. Whatever number is encoded in the exponent bits, you subtract 15 to get the actual exponent, meaning the exponent can be from -14 to +15.

For all of the above, an exponent of all zeros has the special meaning “exponent 0” (and this is where the denormals / subnormals come into play) and all ones has the special meaning “infinity”

The exponent bits tell you which power of two numbers you are between – [2^{exponent}, 2^{exponent+1}) – and the mantissa tells you where you are in that range.

What precision do I have at a number?

Let’s look at the number 3.5.

To figure out the precision we have at that number, we figure out what power of two range it’s between and then subdivide that range using the mantissa bits.

3.5 is between 2 and 4. That means we are diving the range of numbers 2 to 4 using the mantissa bits. A float has 23 bits of mantissa, so the precision we have at 3.5 is:

\frac{4-2}{2^{23}} = \frac{2}{8388608} \approx 0.000000238418579

3.5 itself is actually exactly representable by a float, double or half, but the amount of precision numbers have at that scale is that value. The smallest number you can add or subtract to a value between 2 and 4 is that value. That is the resolution of the values you are working with when working between 2 and 4 using a float.

Using a double instead of a float gives us 52 bits of mantissa, making the precision:

\frac{4-2}{2^{52}} = \frac{2}{4503599627370496} \approx 0.00000000000000044408921

Using a half float with 10 bits of mantissa it becomes:

\frac{4-2}{2^{10}} = \frac{2}{1024} = 0.001953125

Here’s a table showing the amount of precision you get with each data type at various exponent values. N/A is used when an exponent is out of range for the specific data type.

\begin{array}{c|c|c|c|c} exponent & range & half & float & double \\ \hline 0 & [1,2) & 0.0009765625 & 0.00000011920929 & 0.0000000000000002220446 \\ 1 & [2,4) & 0.001953125 & 0.000000238418579 & 0.00000000000000044408921 \\ 2 & [4,8) & 0.00390625 & 0.000000476837158 & 0.00000000000000088817842 \\ 9 & [512, 1024) & 0.5 & 0.00006103515 & 0.00000000000011368684 \\ 10 & [1024,2048) & 1 & 0.00012207031 & 0.00000000000022737368 \\ 11 & [2048,4096) & 2 & 0.00024414062 & 0.00000000000045474735 \\ 12 & [4096,8192) & 4 & 0.00048828125 & 0.0000000000009094947 \\ 15 & [32768, 65536) & 32 & 0.00390625 & 0.0000000000072759576 \\ 16 & [65536, 131072) & N/A & 0.0078125 & 0.0000000000014551915 \\ 17 & [131072, 262144) & N/A & 0.015625 & 0.00000000002910383 \\ 18 & [262144, 524288) & N/A & 0.03125 & 0.000000000058207661 \\ 19 & [524288, 1048576) & N/A & 0.0625 & 0.00000000011641532 \\ 23 & [8388608,16777216) & N/A & 1 & 0.00000000186264515 \\ 52 & [4503599627370496, 9007199254740992) & N/A & 536870912 & 1 \\ \end{array}

A quick note on the maximum number you can store in floating point numbers, by looking at the half float specifically:

A half float has a maximum exponent of 15, which you can see above puts the number range between 32768 and 65536. The precision is 32 which is the smallest step that can be made in a half float at that scale. That range includes the smaller number but not the larger number. That means that the largest number a half float can store is one step away (32) from the high side of that range. So, the largest number that can be stored is 65536 – 32 = 65504.

How Many Digits Can I Rely On?

Another helpful way of looking at floating point precision is how many digits of precision you can rely on.

A float has 23 bits of mantissa, and 2^23 is 8,388,608. 23 bits let you store all 6 digit numbers or lower, and most of the 7 digit numbers. This means that floating point numbers have between 6 and 7 digits of precision, regardless of exponent.

That means that from 0 to 1, you have quite a few decimal places to work with. If you go into the hundreds or thousands, you’ve lost a few. When you get up into the tens of millions, you’ve run out of digits for anything beyond the decimal place.

You can actually see that this is true in the table in the last section. With floating point numbers, it’s at exponent 23 (8,388,608 to 16,777,216) that the precision is at 1. The smallest value that you can add to a floating point value in that range is in fact 1. It’s at this point that you have lost all precision to the right of the decimal place. Interestingly, you still have perfect precision of the integers though.

Half floats have 10 mantissa bits and 2^10 = 1024, so they just barely have 3 digits of precision.

Doubles have 52 mantissa bits and 2^52 = 4,503,599,627,370,496. That means doubles have between 15 and 16 digits of precision.

This can help you understand how precision will break down for you when using a specific data type for a specific magnitude of numbers.

When will I hit precision issues?

Besides the loose rules above about how many digits of precision you can count on, you can also solve to see when precision will break down for you.

Let’s say that you are tracking how long your game has been running (in seconds), and you do so by adding your frame delta (in seconds) to a variable every frame.

If you have a 30fps game, your frame delta is going to be 0.0333.

Adding that each frame to a float will eventually cause the float to reach a value where that number is smaller than the smallest difference representable (smaller than the precision), at which point things will start breaking. At first your accuracy will drop and your time will be wrong, but eventually adding your frame delta to the current time won’t even change the value of the current time. Time will effectively stop!

When will this happen though?

We’ll start with the formula we saw earlier and do one step of simple algebra to get us an equation which can give us this answer.

\frac{range}{mantissa} = precision \\ \\ range = mantissa * precision

How we use this formula is we put the precision we want into “precision” and we put the size of the mantissa (2^{MantissaBits}) into “mantissa”. The result tells us the range that we’ll get the precision at.

Let’s plug in our numbers:

range = 8388608 * 0.0333 = 279340.6464

This tells us the range of the floating point numbers where we’ll have our problems, but this isn’t the value that we’ll have our problems at, so we have another step to do. We need to find what exponent has this range.

Looking at the table earlier in the post you might notice that the range at an exponent also happens to be just 2^{exponent}.

That’s helpful because that just means we take log2 of the answer we got:

log2(279340.6464) = 18.0916659875

Looking at the table again, we can see that floating point numbers have a precision of 0.03125 at exponent value 18. So, exponent 18 is close, but it’s precision is smaller than what we want – aka the precision is still ok.

That means we need to ceil() the number we got from the log2.

Doing that, we see that things break down at exponent 19, which has precision of 0.0625. This actual value it has this problem at is 528,288 (which is 2^{19}).

So, our final formula for “where does precision become this value?” becomes:

value = pow(2, ceil(log2(mantissa * precision)))

Note that at exponent 18, there is still imprecision happening. When adding 1/30 to 264144, It goes from 264144 to 264144.031 to 264144.063, instead of 264144, 264144.033, 264144.066. There is error, but it’s fairly small.

At exponent 19 though, things fall apart a lot more noticeably. When adding 1/30 to 528288, it goes from 528288 to 528288.063 to 528288.125. Time is actually moving almost twice as fast in this case!

At exponent 20, we start at 1056576.00 and adding 1/30 doesn’t even change the value. Time is now stopped.

It does take 6.1 days (528,288 seconds) to reach exponent 19 though, so that’s quite a long time.

If we use half floats, it falls apart at value 64. That’s right, it only takes 64 seconds for this to fall apart when using 16 bit half floats, compared to 6.1 days when using 32 bit floats!

With doubles, it falls apart at value 281,474,976,710,656. That is 8,925,512 years!

Let’s check out that equation again:

value = pow(2, ceil(log2(mantissa * precision)))

A possibly more programmer friendly way to do the above would be to calculate mantissa * precision and then round up to the next power of 2. That’s exactly what the formula is doing above, but in math terms, not programming terms.

Storing Integers

I recently learned that floating point numbers can store integers surprisingly well. It blows my mind that I never knew this. Maybe you are in the same boat 😛

Here’s the setup:

  1. For any exponent, the range of numbers it represents is a power of 2.
  2. The mantissa will always divide that range into a power of 2 different values.

It might take some time and/or brain power to soak that up (it did for me!) but what that ends up ultimately meaning is that floating point numbers can exactly represent a large number of integers.

In fact, a floating point number can EXACTLY store all integers from -2^{MantissaBits+1} to +2^{MantissaBits+1}.

For half floats that means you can store all integers between (and including) -2048 to +2048. (\pm 2^{11})

For floats, it’s -16,777,216 to +16,777,216. (\pm 2^{24})

For doubles it’s -9,007,199,254,740,992 to +9,007,199,254,740,992. (\pm 2^{53})

Doubles can in fact exactly represent any 32 bit unsigned integer, since 2^32 = 4,294,967,296.

Links

Here are some links you might find interesting!

Floating point visually explained:
http://fabiensanglard.net/floating_point_visually_explained/

What Every Computer Scientist Should Know About Floating-Point Arithmetic:
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

A matter of precision:
http://tomforsyth1000.github.io/blog.wiki.html#[[A%20matter%20of%20precision]]

Denormal numbers – aka very small numbers that make computations slow when you use them:
https://en.m.wikipedia.org/wiki/Denormal_number

Catastrophic Cancellation – a problem you can run into when doing floating point math:
https://en.wikipedia.org/wiki/Loss_of_significance

A handy web page that lets you play with the binary representation of a float and what number it comes out as:
https://www.h-schmidt.net/FloatConverter/IEEE754.html

Half precision floating point format:
https://en.wikipedia.org/wiki/Half-precision_floating-point_format

What is the first integer that a float is incapable of representing?
https://stackoverflow.com/questions/3793838/which-is-the-first-integer-that-an-ieee-754-float-is-incapable-of-representing-e

Ready to go deeper? Bruce Dawson has some amazing write ups on deeper floating point issues:
https://randomascii.wordpress.com/category/floating-point/

This talks about how to use floating point precision limits as an activation function in a neural network (?!)
https://blog.openai.com/nonlinear-computation-in-linear-networks/


8 comments

  1. Pingback: Jitterbugs – Why moving your pixels at sub-pixel precision is hard - My Blog

  2. Good explanations and the point about the progression of time error is very important, as it feeds into many calculations.

    Some comments:

    1. It may help to separate precision from accuracy, where accuracy refers to the resolution, or distance between one representable number and the next. For single precision floats, the precision is always 32bit.

    3. The accuracy, on the other hand, will vary with size of number represented long before you reach 6 digits.
    The resolution at 1.0 (between 0.5 and 1.0) is double that at 2.0 because at 2.0, there is a 1 in the exponent, so the gap between numbers is double than that at 1.0.

    4. For 3-dimensional space, the largest (worst case) gap error between representable points is a geometric formula (involving square root) and this is larger (about 3.4 times) than the values calculated for 1 dimension. To this one must include the temporal error as well, so one could well expect a larger multiple than 3.4.

    Like

  3. Pingback: Fast, Near Lossless ‘Compression’ of Normal Floats | Martin Fuller's Graphics Ramblings


Leave a comment