Bezier Triangles

There’s an interactive Bezier Triangles webGL demo I made that goes with this post, and what i used to make the 3d rendered images with. You can find here:

Above is a cubic bezier triangle which has 10 control points. Each edge of the triangle is a cubic Bezier. That makes for 9 control points, and the 10th control point is in the center of the triangle.

I’ve been thinking more about using raytracing for data lookups ( and have been trying to think of how you might get higher order interpolation than linear barycentric interpolation. The hope is to fit the data better, and with fewer data points, resulting in a sparser data mesh.

The higher order interpolation may take a little more compute and memory access, but having fewer triangles to raytrace against is a good thing (smaller BVH) besides the potentially better data fit. There’s a trade off here, I’m sure – and I’m sure that trade off is situational.

In the past, I found a way to exploit the linear interpolation of texture samplers to get higher order interpolation ( but when thinking about how to extend that to triangles in a vertex shader, I never found a good way to do it.

One of the challenges in the vertex shader triangle case is not easily being able to get barycentric coordinates on the triangle, but that’s drop dead simple in the raytracing case, since barycentric coordinates are one of the few pieces of data you actually have out of the box.

These ideas came together and I realized it was time to finally dig into the guts of Bezier triangles.

Luckily, it turns out they are not that scary!

Here is some info on Bezier curves that are good background for the rest of this post:

Bezier Simplex and Barycentric Interpolation

Bezier triangles are very close conceptually to Bezier curves, and in fact, both of them are examples of a Bezier Simplex.

A simplex is the shape with the fewest number of points to create an object in a given dimension.

A 0 dimensional simplex is a point, a 1 dimensional simplex is a line, a 2 dimension simplex is a triangle, a 3 dimensional simplex is a tetrahedron, and so on.

Barycentric coordinates are coordinates on a simplex, and sum up to 1.0.

Bezier curves are parameterized by a value t, like in the linear Bezier curve formula below (which is also just a lerp):


Another way to think of this though is that there are two parameters s and t, where they sum up to 1. That makes s=1-t. That lets us write the same equation like the below:


That is 1 dimensional linear barycentric interpolation between the values A and B on a line.

We can bring that up to 2 dimensions to get a linear barycentric interpolation between 3 points on a triangle.


That extends to any dimension by just adding more terms.

Much like we can bring this concept to higher order interpolation on a line with Bezier curves, we can do higher order Barycentric interpolation on simplices of higher dimensions. If you do this with a triangle, you get a Bezier triangle. If you do this with a tetrahedron, you get a Bezier tetrahedron, and so on.

Calculating Barycentric Coordinates

Calculating the barycentric coordinates is conceptually pretty simple.

  1. Start with an N dimensional simplex.
  2. When you place a point in it, that divides the simplex into several simplices of the same dimension. Each of these other simplices will be opposite a point on the larger simplex.
  3. Calculate the area of a smaller simplex and divide it by the area of the large simplex. This is the barycentric coordinate of the point that is opposite this simplex.

That explanation might have been a little thick. Lets look at lines and triangles as an example.

First is lines:

The point A divides the line (1-simplex) into two parts: P10 to A, and A to P01. P10 to A is 1 unit long. A to P01 is 4 units long. The entire line’s length is 5 units long.

To calculate the barycentric coordinates (s,t) we first start with s.

The simplex opposite P10 is the A to P01 line segment, which has a length of 4. Dividing that by 5, we get a value of 0.8. So, the value of s is 0.8.

The value of t can be found the same way. The opposite line segment of P01 is P10 to A and has a length of 1, which we can divide by 5 to get 0.2. We could have also just calculated 1 – 0.8 since they have to add up to 1.0!

The Barycentric coordinates of point A on that line is (0.8, 0.2).

Calculating the Barycentric coordinates of a triangle is very similar. In the image below, point A has Barycentric coordinates of (0.5, 0.333, 0.166)

A point is known to be inside a simplex if all the components of it’s Barycentric coordinates are between 0 and 1.

The De Casteljau Algorithm For Bezier Triangles

Starting with linear Bezier triangles, you have a control point at each corner of the triangle, and you use barycentric interpolation of those values to get the value at a point on the triangle.

This is a linear interpolation (the result is on a plane), and is what we’ve been doing in vertex shaders since vertex shaders existed. Simple barycentric interpolation.

Much like how a linear Bezier curve is a linear interpolation (lerp) between two values, a linear Bezier triangle is a linear interpolation between three values. Each edge of the triangle is also a linear Bezier curve.

Next up is a quadratic Bezier triangle.

A quadratic Bezier curve is made by linearly interpolating between two linear Bezier curves. A quadratic Bezier triangle is made by linearly interpolating between three linear Bezier triangles.

Here is a refresher on how a quadratic Bezier curve actually works. There are 3 control points A, B, C. One linear interpolation is from A to B. The other linear interpolation is from B to C. These values are linearly interpolated for the final point on the curve.

How does this work on a Bezier triangle? Each edge of the triangle in a quadratic Bezier curve which gives us control points like the below:

What we do first is use point P’s Barycentric coordinates to interpolate the values of the red triangle: 002, 011, 101 to get point A. You treat those values as if they are at the corners of the larger triangle (if you think about it, this is how quadratic Bezier curves work too with the time parameter. Each line segment has t=0 at the start and t=1 at the end).

We next interpolate the values of the green triangle: 011, 020, 110 to get point B.

We then interpolate the values of the blue triangle: 101, 110, 200 to get the point C.

Now we can do a regular linear barycentric interpolation between those three values, using the same barycentric coordinates, and get a quadratic Bezier triangle. Each edge of the triangle is also a quadratic Bezier Curve.

For a cubic Bezier triangle, it’s much the same. You evaluate a quadratic Bezier for each corner, and then linearly interpolate between them to get a cubic. Below is an image showing the control points used to create the three quadratic Bezier triangles.

Which results in a cubic Bezier triangle. The edges of this triangle are cubic Bezier curves.

This works the same way with any Bezier simplex by the way.

The Formula For Bezier Curves

Just like Bezier curves, you can come up with a multivariable polynomial to plug values into for Bezier Triangles, instead of using the De Casteljau algorithm. The De casteljau algorithm is more numerically stable, but the polynomial form is a lot more direct to calculate.

Let’s consider Bezier curves again first.

Lines (a 1-simplex) have two points, so we’ll describe control points as P_{xy}. If we want to make a degree N curve, our control points will be all the x,y pairs that add up to N. Here are the control points for the first few degrees of Bezier curves.

  1. Linear: P_{10}, P_{01}
  2. Quadratic: P_{20}, P_{11}, P_{02}
  3. Cubic: P_{30}, P_{21}, P_{12}, P_{03}

You might notice that a degree N curve has N+1 control points. Linear curves have 2 control points, quadratic curves have 3 control points, cubic curves have 4 control points and so on.

The index values on these control points tell you where they are on the line, and they also tell you the power of each barycentric coordinate they have, which we will get to in a minute.

The last part of making the formula is that the control point terms are also multiplied by the Nth row of Pascal’s triangle where row 0 is at the top row. (Pascal’s triangle image from

Doing that, you have everything you need to create the familiar Bernstein basis polynomial form of Bezier curves.

The image below puts it all together. Orange is the name of the control point, which you can see also describes where it is on a line. The index values of the control point also describe the power of the Barycentric coordinates s and t which are in green. Lastly, the row of pascal’s triangle is in blue. To get the formula, you multiply each of those three things for each control point, and sum them up.

The Formula For Bezier Triangles

You can follow the same steps as the above for making the formula for Bezier triangles, but you need to use Pascal’s pyramid (aka trinomial coefficients) instead of Pascal’s triangle for the control points. (image from

This time, we have 3 index numbers on control points instead of the 2 that curves had, but we still find all permutations which add up to N.

Here are the control points for the first few degrees of Bezier triangles.

  1. Linear: P_{100}, P_{010}, P_{001}
  2. Quadratic: P_{200}, P_{110}, P_{101}, P_{020}, P_{011}, P_{002}
  3. Cubic: P_{300}, P_{201}, P_{210}, P_{120}, P_{111}, P_{102}, P_{030}, P_{021}, P_{012}, P_{003}

Below are diagrams in the same style as the last section, which show the equation for a Bezier triangle and how to come up with it.

This pattern works for any Bezier simplex.

Use For Data Lookup On the GPU

If you wanted to use this for data lookup, you’d first have to fit your data with a Bezier triangle mesh, using whatever degree you wanted to use (quadratic or cubic are the seemingly likely choices!). The fit would involve not only finding the right values to give at control points, but also the right location to put the vertices of the triangles, so that you could make the mesh sparser where you could get away with it, and denser where you needed it.

Doing that fit sounds challenging, but I bet doing gradient descent of vertex positions and control points could be a decent place to start. Simulated annealing probably would help too.

Once you have your data fit, you need to figure out how you are storing it. Linear Bezier triangles only need data stored per vertex, but quadratic ones store data per edges too, and cubic ones store data per triangle. Beyond cubic, no special class of storage is needed.

Storing data per vertex and per triangle is easy, because that’s the paradigm we work with already in graphics (and raytracing specifically). The per edge data is harder. The reason that it’s harder is that to be fast, you want it to be a simple array, where you index it by the two neighboring triangles of the edge. A 16 bit index for triangle A and 16 bit index for triangle B means a 32 bit index into the edge data array. The problem is, that data is extremely sparse (each triangle index will connect to just 3 other triangle indices. The rest of the entries will be empty / useless!), so making a simple array is a lot of wasted space, and frankly is just a lot of space in general.

One idea is to store edge information in the per triangle data, even though it’s redundant (neighboring triangles would have duplicate info for the shared edge). It still would be a simple lookup.

Another idea is to store 3 edge indices in each triangle and use that to look up into an edge array for the per edge data. The memory usage here is minimized to only what is needed, but it’s a double dereference, so is pointer chasing on the GPU which doesn’t sound very nice.

Another idea would be to use minimal perfect hashing ( What that would allow you to do is get a hash function where you could put in the triangle index and which edge you were querying for (0, 1 or 2) it would give you as output the index into an edge array to get the data from. This is different from the last one because you aren’t reading an index for the edge, so is one less dependent data lookup.

Once you have your mesh fit, and you have a way to look up the data needed to evaluate a point on your Bezier mesh, assuming your mesh is oriented on the X,Z plane, with a Y value of 0 (the “height” data being stored in a different data channel), you can shoot a ray from (x,y,1) to (x,y,-1) for the (x,y) point you want to do a data lookup for. The ray will hit a triangle, give you barycentric coordinates, the triangle id, and enough info for you to read the control points for that triangle.

From there, you do your calculations and get your result. All done!

Is This a “1D” Bezier Triangle?

You might have noticed that with things as I’ve described them, control points can only be moved up and down on the vertical axis, but not on the horizontal plane.

That is true and this is a simpler form of a more general Bezier triangle.

If you want control points that you can move in any direction in 3d, all you need to do is make one Bezier triangle that you evaluate to get an X axis value, another to get a Y axis value, and another to get a X axis value.

Using it for data lookup seems to go out the window, but for other usage cases, this is probably useful knowledge.

How Did You Render The Surfaces?

TL;DR I use ray marching, but keep reading for more details.

How the 3d surfaces were rendered (the red/black checkerboard 3d images) is nearly out of scope of this post, but someone asked so I thought I’d share. Feel free to “view source” on that demo. It’s written in webgl so should be pretty easy to follow the shader etc.

First, i get a ray starting position and direction for each pixel of the image.

I next intersect that ray against a triangular prism shape (the green background you see behind the object). I do that by intersecting the ray by each plane that makes up that shape, and keeping track of the min and max hit time of the ray. If the min > the max, i know the ray missed the bounding shape and i can just draw black.

Next up, I calculate if the min time of the intersection is above or below the triangle bezier surface. I do this by getting the (X,Y,Z) position of the ray at that time, using (X,Z) to calculate Barycentric coordinates, using those to evaluate the triangle and see if the triangle’s height at that point is above or below the ray’s height (Y axis value).

Next, i take up to 20 steps from the start time to the end time, stopping when either the “aboveness” of the ray changes (meaning it intersected the surface) or if i get to the end of the 20 steps and nothing was hit.

If the ray intersected the surface, i check how much above the surface the ray was on the previous step vs where it is in the current step, treat it as a line (make a y=mx+b function out of these data points!) and solve for zero. I take that as my intersection point.

I next calculate the gradient of the surface using finite differences (, forward differences specifically, and use that to get a surface normal for shading.

Some improvements that could be done here…

  1. It’s possible to get the gradient of a Barycentric triangle analytically, and isn’t that difficult since it’s just a polynomial. Doing that instead of finite differences would make a more accurate and better gradient.
  2. Instead of taking 20 fixed sized steps, I could use the gradient to get a distance estimate to be able to take larger steps down the ray when it’s farther from the surface (more info here:
  3. If this needed to run faster, the step count could be dropped down to a smaller number. If you did that though, you’d start to notice “stairs” and other artifacts. A way to combat these visual problems would be to use a screen space blue noise texture as a “random number” from 0 to 1 to push each ray down that percentage of a step away from it’s starting point. That would turn the visual artifacts into a more pleasant blue noise pattern – the error would be evenly distributed and be much harder to notice. If you animated that noise well over time, it would look even better. If you did this under TAA, and possibly switched to interleaved gradient noise (something to try, not a slam dunk for sure always better thing IMO), it would look very nice. (more info on animating noise over time and also the follow up post)
  4. Another thing to try would be to not do evenly spaced sampling down the ray from min to max time. Uniform sampling is not great. For best results, you’d want to distribute the sampling locations in a 1d blue noise pattern so that it evenly distributed the error and had low aliasing to give you the most pleasant result. This would be hard to be compatible with some of the other options, but is a nice thing to try if not doing them.


There are lots of extensions that you could do to these Bezier triangles.

One extension would be to bring this to volumetric data with Bezier tetrahedrons. The result when using it for data meshes would be a tetrahedral mesh. To do a look up for a specific (x,y,z), you would shoot a ray from that point in any direction. The triangle you hit would contain 3 of the 4 points of the tetrahedron. You could shoot a ray in the opposite direction to get another triangle which had the 4th, or you could have data stored for that triangle index which says “if you hit this triangle from the front, the fourth vertex is M, if you hit this triangle from the back, the fourth vertex is N”, which means you just need to do a dot product of the ray and the triangle normal to figure out the fourth vertex, instead of doing another ray intersection against the BVH.

An easy but powerful extension would be that you could calculate the value of a point in two different Bezier triangles and divide the values. This would give you a rational surface which can represent a LOT more shapes than an integral surface can. For example, a quadratic Bezier curve can’t make conic sections (circles, etc), but a rational quadratic Bezier can make them, exactly.

Making it rational is just the R in NURBS, so other extension ideas are… NU = non uniform control points. R = rational instead of integral. BS = B-Spline.

Another extension that you could do is not limit the shape to being inside the simplex. For instance, here’s a cubic Bezier triangle where i let the equations go outside of the triangle, but am still limiting it to a cube. The math is totally fine with this. No division by zeros or any other badness.

Lastly, it turns out you can extend the concept of Barycentric coordinates to convex shapes that aren’t a simplex, such as a pentagon, or a dodecahedron. In the case of a quadratic Bezier pentagon, you would evaluate 5 linear pentagons – one for each corner of the pentagon – and then combine those with another linear Bezier pentagon interpolation. I’m not sure when this would be useful, but it’s definitely possible.

Check out this link for more info about generalized Barycentric coordinates:


There are a lot of links in this post, but I found this page to be really helpful for understanding some of the details of Bezier triangles.

Calculating Information Entropy

This post has an overview of information entropy then moves onto some technical details and experiments.

If you have any comments, questions, important additions, etc, please hit me up by commenting on this post, or on twitter at

The code that goes with this post is at

Thanks to all the people on twitter who answered questions & helped me learn this stuff. Links to threads and resources are at the bottom of the post!

Information Entropy Overview

Information entropy is a measure of how much information there is in some specific data. It isn’t the length of the data, but the actual amount of information it contains.

For example, one text file could contain “Apples are red.” and another text file could contain “Apples are red. Apples are red. Apples are red.”. The second text file is 3 times longer than the first but doesn’t really give any extra information. These two text files would have nearly the same amount of information entropy, despite them having different lengths.

It isn’t really possible to calculate the actual amount of information entropy a specific piece of data has in it because it’s related to Kolmogorov complexity: the length of the shortest program that can generate the data (, which turns out to be an incalculable value. That’s why i said that those two text files would have nearly the same amount of information entropy instead of saying they had the same amount. A program to generate the second text file is probably going to be very slightly more complex than the text file to generate the first one: it’ll have a for loop on the outside!

You can get a pretty good upper bound on information entropy though by making a histogram of how often each symbol appears, treating that histogram like a weighted N sided dice (a probability mass function or PMF), and calculating the entropy of that dice.

Interestingly, Huffman compression is related to this ( In Huffman compression, you take a histogram of the data, and give shorter length bit patterns to the symbols that occur more. This makes the overall data smaller, but doesn’t change the amount of information there is in the data. In fact, it’s part of a family of compressors called “Entropy encoders” (

Because of this, you can get a pretty good idea of how much information is in some data by how much you can compress it. If it shrinks a lot, the information density was low. If it doesn’t shrink much, or even gets larger (which happens when the compressed data plus the compression header is larger than the original data!), then the information density is pretty high.

However, there are counter examples to that. Encrypting data hides the patterns of the data, making it look more like uniformly distributed white noise (independent random numbers) while keeping the data the same size. Encrypting data doesn’t actually change the amount of entropy in the data, but it changes the APPARENT amount of entropy. It makes the “shortest program that can generate the data” harder to find, and in fact, encrypted data will not compress!

This actually feels opposite to some things i’ve read about machine learning type algorithms. In machine learning, finding the solution to a problem involves finding the global minimum, and you want a global minimum to be wide and flat so that it’s easy to find and hard to fall out of. There are techniques to make this be more true, such as the SVM kernel trick, or neural networks which make the problem space be higher dimensional and be saddle shaped to help not get stuck in local minima. Encryption seems to do the opposite, making the minimums (global OR local!) of the Kolmogorov complexity be very hard to find, like a dirac delta, where there is no hint where they are so that you have to know where they are in advance (aka you have to know the encryption key!) to be able to find the minimum.

Before moving onto details and experiments, here is a really cool read “Entropy and Redundancy in English” which talks about how Claude Shannon calculated letters in the english language to have about 2.62 bits of information per letter.

Interestingly, that last link also talks about something I wrote up previously, which is using markov chains to generate text.

There is also a really interesting tweet from Donald Mitchell (who is famous for research which includes Mitchell’s best candidate algorithm for generating blue noise sample points!), which talks about encryption and compression and generating valid looking language. It’s a very cool tangle of ideas I’m still sorting out in my head 🙂

That “transforms English sentences to other English sentences” part also sounds a lot like format preserving encryption.

The Formula

To calculate information entropy, you need to calculate the entropy for each possible event or symbol and then sum them all up.

To calculate the entropy of a specific event X with probability P(X) you calculate this: -P(X)*log_2(P(X))

As an example, let’s calculate the entropy of a fair coin.

The probability of heads is 50%. Here’s the entropy we get when plugging that 0.5 into the equation:
E_{\text{heads}} = -0.5*log_2(0.5) \\ E_{\text{heads}} = -0.5*-1 \\ E_{\text{heads}} = 0.5
Since tails has the same probability it has the same entropy as heads (which is 0.5) so we add the entropy of heads and tails to get 1 bit of entropy for a fair coin flip. You can alternately say it’s 1 Shannon of entropy, and if you use different log bases, there are other unit of measurements. The important take away though is that a fair coin gives 1 bit of entropy.

What if the coin isn’t fair though? Here’s the entropy if there’s a 75% chance of heads and a 25% chance of tails.
E_{\text{heads}} = -0.75 * log_2(0.75) \\ E_{\text{heads}} = 0.311 \\ E_{\text{tails}} = -0.25 * log_2(0.25) \\ E_{\text{tails}} = 0.5 \\ E = 0.311 + 0.5 = 0.811
So, in this case, each flip of this unfair coin only gives you 0.811 bits of information. It kind of makes sense because each flip you are pretty sure it’s going to be heads, so when it is heads, it doesn’t give you a whole lot of information.

What if the coin somehow is incapable of landing tails. Maybe it is just a 2 headed coin.
E_{\text{heads}} = -1 * log_2(1) \\ E_{\text{heads}} = 0 \\ E_{\text{tails}} = 0 * log_2(0) \\ E_{\text{tails}} = 0 \\ E = 0 + 0 = 0
Flipping this maximally unfair coin gives us zero bits of information which makes sense. We know that it will always land heads so the coin flip gives us absolutely no new information.

This equation works the same way no matter how many events or what their probabilities are. It works with 6 sided dice. It works with byte values in a text file, it also works with bits in a text file (going back to heads or tails).

When you have N discrete symbols or events, and a probability of encountering each, it’s called a probability mass function or PMF, which again, can be made from a histogram.

Interestingly, entropy calculations have a full suite of calculations similar to probability calculations. There is joint entropy, conditional entropy, the chain rule of entropy and even a Bayes rule of entropy!

Text Experiments

The C++ code and source data for these experiments can be found on github at

We are going to start with a 26KB text file of “The Last Question” by Isaac Asimov. (from

We can make a histogram of how many times each 8 bit letter occurred, then turn that into a PMF by calculating the percentage occurrence of each character. From there we can calculate the entropy and divide by 8 to get the entropy per bit. Doing that we get the value 0.582. So, there are 0.582 bits of information entropy per bit of data in that text file. (Technically: or less)

If we compress it with the standard zip file compressor in windows, making an 11KB zip file, then do the same to that file, we get a value of 0.962 bits of information per bit of data in that text file. The file shrank to 42.3% of the size (the old file is 2.36 times larger) and we got 1.65 times as much information entropy.

If instead of compressing the text file, we encrypt it with openssl (command: openssl enc -aes-256-cbc -salt -in lastquestion.txt -out lastquestion.enc -pass pass:moreentropyplease), then calculate the entropy of that file, we get a value of 0.926 bits of information entropy per bit of data.

It’s weird to me that the encrypted data reports a lower entropy than the compressed data, and i’d expect the opposite. The issue might be that our calculation of entropy is naive, and that a more robust entropy calculation would show the encrypted data to have more entropy.

An example of being less naive would like making a histogram of 4 bit, 6 bit, 12 bit values etc instead of only 8. Another thing to do would be to try calculating entropy of re-arrangements of the data. Ultimately, you would report the lowest entropy seen from different “views” of the data.

I’m not sure if that’s what’s at play in the entropy between encrypted and compressed data though.

Interestingly, if you try to compress the encrypted data, it doesn’t really compress and stays at 26KB. The encryption has hidden the structure of the source data pretty well! It now reports an information entropy per bit of 0.923 which is very slightly down from 0.926 bits. It might be the added useless compression header causing the information entropy to go down.

What if we take the zip file of the plain text file and base 64 encode it? We are taking the same data and encoding it in fewer 8 bit symbols. That makes the file larger without changing the amount of information in the data. It’s like the reverse of compression so we’d expect the amount of information entropy per bit to go down and that is what happens! The file increases from 11KB to 14KB and we get an information entropy per bit value of 0.749, which is down from 0.962.

Noise Distributions

What if we look at the entropy of uniform white noise (plain vanilla random numbers)? Taking 100,000 white noise bytes, building a histogram of each 8 bit value, making a PMF and calculating entropy, we get 0.999733. That makes sense because white noise gives maximal entropy so we’d expect a value of 1 and got something very close.

In a recent post i talked about using dice to generate different colors and distributions of noises ( Do different distributions of noise have different amounts of entropy?

First, let’s look at a triangle distribution vs a uniform distribution.

Rolling a 6 sided die gives us a uniform distribution with possible values 1 through 6. The entropy of a 6 sided die is 2.585 bits.

If we wanted a triangle distribution with possible values 1 through 6, we could roll a 4 sided die and a 3 sided die (strangely, they exist! and add them to get values 2 through 7, then we could subtract 1 to get values 1 through 6.

We can make a histogram by looking at how many ways each possible number can come up, and turn them into percentages:

  • one= 1 = 8%
  • two = 2 = 17%
  • three = 3 = 25%
  • four = 3 = 25%
  • five = 2 = 17%
  • six = 1 = 8%

Summing the entropy of each of those percentages gives about 2.454 vs a uniform 6 sided die giving 2.585 bits. So, triangular distributed values have less entropy than uniform distributed values. That makes a bit of sense because we already knew that a fair coin gave more entropy than an unfair coin, and a triangle distribution is like an extension of an unfair coin to dice.

Noise Colors

Next, let’s look at the color of noise (frequency composition / correlation). Let’s use blue noise because that’s my favorite.

I made some floating point 1d blue noise data (with values between 0 and 1) using Mitchell’s best candidate algorithm (, and converted the floating point values to uint8.

If you look at a histogram of those uint8 values, it will show a flat histogram, meaning a uniform distribution and will give maximal entropy.

That isn’t quite right though, because we know that the values have randomized high frequency components and no low frequency components.

That means that each roll is based on the previous rolls, and so the rolls aren’t uncorrelated, and they aren’t independent. That means they should have less entropy.

One idea to capture this relationship is to take each pair of uint8s in the data, make them into a uint16, and make a histogram of those uint16s (Thanks Nathan! If the uint8s were uncorrelated and evenly distributed, then the uint16s should be evenly distributed. That would mean that a value didn’t care about what it’s previous value was.

Since we know there is correlation in the blue noise, the result in our case shouldn’t be evenly distributed, and should result in less entropy than white noise.

That’s the result we get in fact, and so for 100,000 uint8 blue noise values, we get an entropy of 0.92608, where white noise gives 0.965912.

So interestingly, blue noise is more predictable and less random than white noise, which shouldn’t come as a huge surprise, since we know the PMF isn’t a uniform distribution at each new value.

Making a histogram of 16 bit values of the 8 bit stream like the above is equivalent to making an order 1 Markov chain and only has a “memory” of the previous value. You could make higher order Markov chains by making 24 or 32 bit values and beyond, to find correlation that went beyond immediate neighbors, but keeping a histogram of 2^24 or 2^32 values and larger is a pretty large amount of memory. You’d also need a lot more source data, to give the histogram a chance to stabilize / converge.

There might also be a delay of correlation, like each value could be correlated with the value that is 100 values ago. At one byte per value, you’d need an order 100 Markov chain to capture that relationship, and so would need a 2^800 bytes of memory which is astronomically large.

Other Info

If you look at the csv in the output folder of the repository for this post (, you’ll see some oddities.

One is how when looking at 11 bits at a time, the plain text has more entropy than when looking at it with 8 or 12 bits. The reason for that specific thing is that the data inherently is 8 bit, which is coprime to the 11 bits we are looking at it with, so makes the entropy look a lot higher by making it seem like there are more unique bit patterns than there are. Ultimately, when calculating entropy in a variety of ways, you should take the minimum value you find as the final answer, because it’s related to the shortest program you can reproduce it with. There are plenty of longer programs, but you only care about the shortest. You should also be careful not to get too crazy in whatever code you use to make your “view” of the data (re-arranging values / looking forward & backward a far amount) because that stuff is part of the program needed to reproduce the data, but isn’t represented in the entropy calculation you are doing.

Something else is that if you don’t have enough data, the histogram / PMF won’t converge to the actual values. You can see this manifesting in the csv data in the difference between the “small white noise” which is 8 bytes of white noise, and the “white noise” which is 100,000 bytes of white noise. They really do have the same characteristics and entropy amount per bit, but the small white noise doesn’t have enough data available to show the correct values.


A very related, but more formalized, write up on information theory.

A stack exchange question about how to calculate entropy of correlated samples.

Here’s a neat post that shows how to visualize entropy in a binary file, to find where encryption keys are since they are high entropy. Thanks Ashley! (

The “Breaking Math” podcast episode 24 is really good and related to this stuff:

“The Complexity Barrier” blog post by John Baez

A thread showing how Oskar Stålberg is bitbacking data very efficiently, so the data is not compressing very well at all.

Some related twitter megathreads:

The difference between uncorrelated and independent: