This post has an overview of information entropy then moves onto some technical details and experiments.
If you have any comments, questions, important additions, etc, please hit me up by commenting on this post, or on twitter at https://twitter.com/Atrix256.
The code that goes with this post is at https://github.com/Atrix256/Entropy/
Thanks to all the people on twitter who answered questions & helped me learn this stuff. Links to threads and resources are at the bottom of the post!
Information Entropy Overview
Information entropy is a measure of how much information there is in some specific data. It isn’t the length of the data, but the actual amount of information it contains.
For example, one text file could contain “Apples are red.” and another text file could contain “Apples are red. Apples are red. Apples are red.”. The second text file is 3 times longer than the first but doesn’t really give any extra information. These two text files would have nearly the same amount of information entropy, despite them having different lengths.
It isn’t really possible to calculate the actual amount of information entropy a specific piece of data has in it because it’s related to Kolmogorov complexity: the length of the shortest program that can generate the data (https://en.wikipedia.org/wiki/Kolmogorov_complexity), which turns out to be an incalculable value. That’s why i said that those two text files would have nearly the same amount of information entropy instead of saying they had the same amount. A program to generate the second text file is probably going to be very slightly more complex than the text file to generate the first one: it’ll have a for loop on the outside!
You can get a pretty good upper bound on information entropy though by making a histogram of how often each symbol appears, treating that histogram like a weighted N sided dice (a probability mass function or PMF), and calculating the entropy of that dice.
Interestingly, Huffman compression is related to this (https://en.wikipedia.org/wiki/Huffman_coding). In Huffman compression, you take a histogram of the data, and give shorter length bit patterns to the symbols that occur more. This makes the overall data smaller, but doesn’t change the amount of information there is in the data. In fact, it’s part of a family of compressors called “Entropy encoders” (https://en.wikipedia.org/wiki/Entropy_encoding).
Because of this, you can get a pretty good idea of how much information is in some data by how much you can compress it. If it shrinks a lot, the information density was low. If it doesn’t shrink much, or even gets larger (which happens when the compressed data plus the compression header is larger than the original data!), then the information density is pretty high.
However, there are counter examples to that. Encrypting data hides the patterns of the data, making it look more like uniformly distributed white noise (independent random numbers) while keeping the data the same size. Encrypting data doesn’t actually change the amount of entropy in the data, but it changes the APPARENT amount of entropy. It makes the “shortest program that can generate the data” harder to find, and in fact, encrypted data will not compress!
This actually feels opposite to some things i’ve read about machine learning type algorithms. In machine learning, finding the solution to a problem involves finding the global minimum, and you want a global minimum to be wide and flat so that it’s easy to find and hard to fall out of. There are techniques to make this be more true, such as the SVM kernel trick, or neural networks which make the problem space be higher dimensional and be saddle shaped to help not get stuck in local minima. Encryption seems to do the opposite, making the minimums (global OR local!) of the Kolmogorov complexity be very hard to find, like a dirac delta, where there is no hint where they are so that you have to know where they are in advance (aka you have to know the encryption key!) to be able to find the minimum.
Before moving onto details and experiments, here is a really cool read “Entropy and Redundancy in English” which talks about how Claude Shannon calculated letters in the english language to have about 2.62 bits of information per letter.
Interestingly, that last link also talks about something I wrote up previously, which is using markov chains to generate text.
There is also a really interesting tweet from Donald Mitchell (who is famous for research which includes Mitchell’s best candidate algorithm for generating blue noise sample points!), which talks about encryption and compression and generating valid looking language. It’s a very cool tangle of ideas I’m still sorting out in my head 🙂
That “transforms English sentences to other English sentences” part also sounds a lot like format preserving encryption.
To calculate information entropy, you need to calculate the entropy for each possible event or symbol and then sum them all up.
To calculate the entropy of a specific event X with probability P(X) you calculate this:
As an example, let’s calculate the entropy of a fair coin.
The probability of heads is 50%. Here’s the entropy we get when plugging that 0.5 into the equation:
Since tails has the same probability it has the same entropy as heads (which is 0.5) so we add the entropy of heads and tails to get 1 bit of entropy for a fair coin flip. You can alternately say it’s 1 Shannon of entropy, and if you use different log bases, there are other unit of measurements. The important take away though is that a fair coin gives 1 bit of entropy.
What if the coin isn’t fair though? Here’s the entropy if there’s a 75% chance of heads and a 25% chance of tails.
So, in this case, each flip of this unfair coin only gives you 0.811 bits of information. It kind of makes sense because each flip you are pretty sure it’s going to be heads, so when it is heads, it doesn’t give you a whole lot of information.
What if the coin somehow is incapable of landing tails. Maybe it is just a 2 headed coin.
Flipping this maximally unfair coin gives us zero bits of information which makes sense. We know that it will always land heads so the coin flip gives us absolutely no new information.
This equation works the same way no matter how many events or what their probabilities are. It works with 6 sided dice. It works with byte values in a text file, it also works with bits in a text file (going back to heads or tails).
When you have N discrete symbols or events, and a probability of encountering each, it’s called a probability mass function or PMF, which again, can be made from a histogram.
Interestingly, entropy calculations have a full suite of calculations similar to probability calculations. There is joint entropy, conditional entropy, the chain rule of entropy and even a Bayes rule of entropy!
The C++ code and source data for these experiments can be found on github at https://github.com/Atrix256/Entropy
We are going to start with a 26KB text file of “The Last Question” by Isaac Asimov. (from https://hell.pl//szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm)
We can make a histogram of how many times each 8 bit letter occurred, then turn that into a PMF by calculating the percentage occurrence of each character. From there we can calculate the entropy and divide by 8 to get the entropy per bit. Doing that we get the value 0.582. So, there are 0.582 bits of information entropy per bit of data in that text file. (Technically: or less)
If we compress it with the standard zip file compressor in windows, making an 11KB zip file, then do the same to that file, we get a value of 0.962 bits of information per bit of data in that text file. The file shrank to 42.3% of the size (the old file is 2.36 times larger) and we got 1.65 times as much information entropy.
If instead of compressing the text file, we encrypt it with openssl (command: openssl enc -aes-256-cbc -salt -in lastquestion.txt -out lastquestion.enc -pass pass:moreentropyplease), then calculate the entropy of that file, we get a value of 0.926 bits of information entropy per bit of data.
It’s weird to me that the encrypted data reports a lower entropy than the compressed data, and i’d expect the opposite. The issue might be that our calculation of entropy is naive, and that a more robust entropy calculation would show the encrypted data to have more entropy.
An example of being less naive would like making a histogram of 4 bit, 6 bit, 12 bit values etc instead of only 8. Another thing to do would be to try calculating entropy of re-arrangements of the data. Ultimately, you would report the lowest entropy seen from different “views” of the data.
I’m not sure if that’s what’s at play in the entropy between encrypted and compressed data though.
Interestingly, if you try to compress the encrypted data, it doesn’t really compress and stays at 26KB. The encryption has hidden the structure of the source data pretty well! It now reports an information entropy per bit of 0.923 which is very slightly down from 0.926 bits. It might be the added useless compression header causing the information entropy to go down.
What if we take the zip file of the plain text file and base 64 encode it? We are taking the same data and encoding it in fewer 8 bit symbols. That makes the file larger without changing the amount of information in the data. It’s like the reverse of compression so we’d expect the amount of information entropy per bit to go down and that is what happens! The file increases from 11KB to 14KB and we get an information entropy per bit value of 0.749, which is down from 0.962.
What if we look at the entropy of uniform white noise (plain vanilla random numbers)? Taking 100,000 white noise bytes, building a histogram of each 8 bit value, making a PMF and calculating entropy, we get 0.999733. That makes sense because white noise gives maximal entropy so we’d expect a value of 1 and got something very close.
In a recent post i talked about using dice to generate different colors and distributions of noises (https://blog.demofox.org/2019/07/30/dice-distributions-noise-colors/). Do different distributions of noise have different amounts of entropy?
First, let’s look at a triangle distribution vs a uniform distribution.
Rolling a 6 sided die gives us a uniform distribution with possible values 1 through 6. The entropy of a 6 sided die is 2.585 bits.
If we wanted a triangle distribution with possible values 1 through 6, we could roll a 4 sided die and a 3 sided die (strangely, they exist! https://www.amazon.com/Bescon-Polyhedral-3-sided-Gaming-Assorted/dp/B06XHF4Y6J) and add them to get values 2 through 7, then we could subtract 1 to get values 1 through 6.
We can make a histogram by looking at how many ways each possible number can come up, and turn them into percentages:
- one= 1 = 8%
- two = 2 = 17%
- three = 3 = 25%
- four = 3 = 25%
- five = 2 = 17%
- six = 1 = 8%
Summing the entropy of each of those percentages gives about 2.454 vs a uniform 6 sided die giving 2.585 bits. So, triangular distributed values have less entropy than uniform distributed values. That makes a bit of sense because we already knew that a fair coin gave more entropy than an unfair coin, and a triangle distribution is like an extension of an unfair coin to dice.
Next, let’s look at the color of noise (frequency composition / correlation). Let’s use blue noise because that’s my favorite.
I made some floating point 1d blue noise data (with values between 0 and 1) using Mitchell’s best candidate algorithm (https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/), and converted the floating point values to uint8.
If you look at a histogram of those uint8 values, it will show a flat histogram, meaning a uniform distribution and will give maximal entropy.
That isn’t quite right though, because we know that the values have randomized high frequency components and no low frequency components.
That means that each roll is based on the previous rolls, and so the rolls aren’t uncorrelated, and they aren’t independent. That means they should have less entropy.
One idea to capture this relationship is to take each pair of uint8s in the data, make them into a uint16, and make a histogram of those uint16s (Thanks Nathan! https://twitter.com/Reedbeta). If the uint8s were uncorrelated and evenly distributed, then the uint16s should be evenly distributed. That would mean that a value didn’t care about what it’s previous value was.
Since we know there is correlation in the blue noise, the result in our case shouldn’t be evenly distributed, and should result in less entropy than white noise.
That’s the result we get in fact, and so for 100,000 uint8 blue noise values, we get an entropy of 0.92608, where white noise gives 0.965912.
So interestingly, blue noise is more predictable and less random than white noise, which shouldn’t come as a huge surprise, since we know the PMF isn’t a uniform distribution at each new value.
Making a histogram of 16 bit values of the 8 bit stream like the above is equivalent to making an order 1 Markov chain and only has a “memory” of the previous value. You could make higher order Markov chains by making 24 or 32 bit values and beyond, to find correlation that went beyond immediate neighbors, but keeping a histogram of 2^24 or 2^32 values and larger is a pretty large amount of memory. You’d also need a lot more source data, to give the histogram a chance to stabilize / converge.
There might also be a delay of correlation, like each value could be correlated with the value that is 100 values ago. At one byte per value, you’d need an order 100 Markov chain to capture that relationship, and so would need a 2^800 bytes of memory which is astronomically large.
If you look at the csv in the output folder of the repository for this post (https://github.com/Atrix256/Entropy), you’ll see some oddities.
One is how when looking at 11 bits at a time, the plain text has more entropy than when looking at it with 8 or 12 bits. The reason for that specific thing is that the data inherently is 8 bit, which is coprime to the 11 bits we are looking at it with, so makes the entropy look a lot higher by making it seem like there are more unique bit patterns than there are. Ultimately, when calculating entropy in a variety of ways, you should take the minimum value you find as the final answer, because it’s related to the shortest program you can reproduce it with. There are plenty of longer programs, but you only care about the shortest. You should also be careful not to get too crazy in whatever code you use to make your “view” of the data (re-arranging values / looking forward & backward a far amount) because that stuff is part of the program needed to reproduce the data, but isn’t represented in the entropy calculation you are doing.
Something else is that if you don’t have enough data, the histogram / PMF won’t converge to the actual values. You can see this manifesting in the csv data in the difference between the “small white noise” which is 8 bytes of white noise, and the “white noise” which is 100,000 bytes of white noise. They really do have the same characteristics and entropy amount per bit, but the small white noise doesn’t have enough data available to show the correct values.
A very related, but more formalized, write up on information theory.
A stack exchange question about how to calculate entropy of correlated samples.
Here’s a neat post that shows how to visualize entropy in a binary file, to find where encryption keys are since they are high entropy. Thanks Ashley! (https://twitter.com/ACEfanatic02)
The “Breaking Math” podcast episode 24 is really good and related to this stuff:
“The Complexity Barrier” blog post by John Baez
A thread showing how Oskar Stålberg is bitbacking data very efficiently, so the data is not compressing very well at all.
Some related twitter megathreads:
The difference between uncorrelated and independent: