Multiple Importance Sampling in 1D

This is a follow up to an article I wrote a few years ago on Monte Carlo integration and importance sampling in 1D:

The simple, well commented code that generated all the data for this post can be found at:

A challenge when doing Monte Carlo integration in rendering is that the function you are trying to integrate is often made up of other functions multiplied together. While you may know how to importance sample some of the parts individually, you ultimately have to choose which thing to importance sample, because you are generating random numbers according to whichever thing you choose.

In rendering, the three things usually being multiplied together are lighting, material and visibility (which makes shadows). Lighting and materials are things you can usually importance sample and are based on the type of light (like a spherical area light) and the material model (Like a PBR microfacet BRDF), while visibility is not usually able to be importance sampled because it is entirely due to the geometry in a scene as to whether a pixel can see a light or not.

If you importance sample based on lighting, you can get poor results when the material ended up being more important to the result. Likewise, if you importance sample based on material, you can get poor results when the lighting ended up being more important to the result.

Multiple importance sampling is a way to make it so that you don’t have to choose, and you can get the benefits of both. More generally, it lets you combine N different importance sampling techniques.


Before going into the explanation, here is how you actually get 1 MIS sample using the balance heuristic, when you have two importance sampling techniques:

F is the function being integrated. PDF1 / InverseCDF1 are for the first importance sampling technique. PDF2 / InverseCDF2 are for the second importance sampling technique. You do this in a loop N times, and take the average of those N estimates, to get your final estimate.

You can generalize to more techniques by just following the pattern. Each sampling technique generates it’s own x and y. Each sampling technique calculates the pdf for that x value for each of the other pdfs. The estimate is the sum of: each y value divided by the sum of each pdf for the corresponding x value.

Note that if part of the function F is expensive (like raytracing for visibility!) you don’t have to do that for each sample. You could get your estimate of lighting multiplied by material like in the above, and after combining them, you could then do your raytracing to multiply in the visibility term.

MIS Explained

You can get a single sample from a monte carlo estimator by randomly generating an x value and calculating the estimate as the function value at that x, divided by the PDF value of choosing that x.

\text{Estimate} = \frac{f(x)}{\text{PDF}(x)}

You may also remember that as the shape of the pdf (histogram) of the random numbers gets closer to the shape of the function you are trying to integrate, that you can get a closer estimate to the actual answer with fewer samples. This is called importance sampling.

Let’s say though that you want to integrate the function f multiplied by the function g and you are able to generate random numbers in the shape of f, and random numbers in the shape of g, but not random numbers in the shape of f multiplied by g.

You know that you can choose to importance sample based on f or g, but that the choice is better or worse situationally. Sometimes you want f, other times you want g.

The simplest way to combine these would be to just use them both for each sample and average them. You could also switch off so that even numbered samples importance sampled by f and odd numbered samples importance sampled by g. This is the same as giving each technique a weighting of 0.5.

We can do better though!

We can make an x value to importance sample based on f, and another x value to importance sample based on g, and then we can calculate the PDF values of each x for each PDF.

If we have good importance sampling PDFs, higher PDF values mean higher quality samples, while lower PDF values mean lower quality samples. We now have the means to give a weighting to a sample based on it’s quality as shown below, where we calculate the weight for sample “A”. Sample “B” would do the same.

\text{Weight}_A = \frac{\text{PDF}_A(x_A)}{\text{PDF}_A(x_A)+\text{PDF}_B(x_A)}

This is called the “balance heuristic”. There are other heuristics that you can use instead, which you can read about in Veach’s thesis (in the links section) and other MIS papers which have come out since then.

If we have a Monte Carlo estimate sample like this:


Some interesting cancelation happens if we multiply that by the weight.

\frac{f(x_A)}{\text{PDF}_A(x_A)} * \frac{\text{PDF}_A(x_A)}{\text{PDF}_A(x_A)+\text{PDF}_B(x_A)} = \frac{f(x_A)}{\text{PDF}_A(x_A)+\text{PDF}_B(x_A)}

That form is the same form seen in the code from the last section, where we also had a sample B that we added to it to get the final estimate.

You may be wondering why sample A and sample B are added together… shouldn’t they be averaged?

Well, if you look at the denominator in that last formula, two PDFs are added together. Each PDF has an expected value of 1, so the expected value of that sum in the denominator is going to be 2. That means that the estimate is going to be half as big as it should be. When you add two of them together, they are going to be as large as they should be. All that has happened is that instead of adding them together and dividing by two to average them, we have divided them by two implicitly in advance before adding them. We are still averaging the two samples. It isn’t exactly averaging, since the PDFs will vary from sample to sample, but on the whole, it’s still an unbiased combination of the two PDFs, which is why we still get the correct answer.

If three PDFs were involved, the weighted samples would be one third the size they should be, and there would be three to add together.

If four PDFs were involved, the weighted samples would be one fourth the size they should be, and there would be four to add together.

It generalizes to any number of importance sampling techniques involved.

One Sample MIS

If you are a fan of stochastic rendering like me, you may be wondering if you really have to do both (all) of the samples, or if you can use the weighting to choose one stochastically and end up with the correct result for less work.

Yes, you can indeed do this and in Veach’s thesis he calls this the “One-Sample Model” in section 9.2.4.

In this case, what you do is calculate the weight for each sample, and then divide each of those weights by the sum of the weights to get a probability for taking that specific sample.

After you roll a random number and choose the single sample to contribute to the estimate, you need to multiply the Monte Carlo estimate by the chance of choosing that item. You are left with something that uses multiple PDFs for importance sampling different parts of the function, but each sample evaluates the function F only once. Useful if F is costly to evaluate.

If you expand out weight1, weight2 and weight1chance, you’ll find that some things cancel out and you are left with the below for actually calculating the estimate. I have to admit I don’t have a good intuitive explanation for why that works, but the algebra says it does, and it checks out experimentally. If you have an explanation, leave a comment!

Piecewise Importance Sampling

Multiple importance sampling is a method for combining any number of importance sampling techniques to sample a specific function.

Something interesting though is that not every PDF involved has to cover the entire function.

What i mean is that you could have a PDF which sampled only from the left half of the domain of a function, and another PDF which sampled only from the right half of the domain of a function.

What would happen is that the inverse CDF for the first technique would only generate x values on the left half of the function to integrate, and the PDF would give zero for any value on the right have of the function.

The second technique would do the opposite.

MIS would not care about this in the least. It would function as normal and let you importance sample a function piecewise, if you could make PDFs that fit the parts of a function well, but weren’t able to make a PDF that fit the entire function well.

Veach’s thesis goes into other things as well, such as being able to give different sample counts to different techniques. It’s definitely worth a read!

Experiment #1 – Importance Sampling & Warm Up

Quick reminder, the code that made the data for these experiments is at:

First up we are going to integrate the function y=\sin(x)*\sin(x) from 0 to pi, doing 10,000 different tests, each test doing 5000 samples, and average the results. We are going to use regular Monte Carlo (mc) as well as importance sampled Monte Carlo (ismc), using the PDF y=\sin(x)/2. Below is the function we want to integrate, and the PDF that we are going to use to importance sample it.

We could show the absolute value of the error at each step (the error being averaged over all those tests) and get this. (data from out1.abse.csv)

That isn’t super easy to read other than seeing importance sampling seems to be less erratic and lower error more reliably. We can change it to be on a log/log plot which helps see decay rates better (especially when things like low discrepancy sequences are involved, which we’ll see later).

That’s an improvement, but there is a lot of noise, even after 10,000 tests. Monte Carlo is noisy by definition, so as you can see, sometimes it gets really low error, but then pops right back up in the next few samples. That erratic nature is not good and if you are doing integration per pixel, the variance will make the noise especially bad. In fact, variance is what we really care about. So long as the integration is converging to the right thing (is unbiased / has zero bias), variance will tell us how quickly it is converging on the right answer.

Here is a log/log variance graph. You can more easily see that the importance sampling is a clear win over the non importance sampled Monte Carlo Integration. (data from out1.var.csv)

Now that we see that yes, importance sampling is helpful, and we have our testing conventions worked out, let’s continue on to more interesting topics!

Experiment #2 – Multiple Importance Sampling

Next up, we are going to integrate the function y=\sin(x)*2x from 0 to pi. We are going to use regular Monte Carlo, but also importance sample using y=\sin(x)/2 again, and also y=x*\frac{2}{\pi^2}. We are also going to do multiple importance sampling using both of those PDFs in conjunction, and also do the “single sample method” of MIS. Here are the functions mentioned.

Here is the log/log variance graph.

Monte Carlo (mc, blue) is the obvious worst. Multiple importance sampling (mismc, green) is the obvious best. The second place worst is importance sampling by the line function (ismc2, yellow). The second place best is importance sampling by the sin based PDF (ismc1, red). The one sample method (mismcstoc, purple) seems to be basically the same as the red line. It takes half as many samples as mismc, so it isn’t surprising that it does worse.

It is good to see that multiple importance sampling is worth while though and does significantly better than the two importance sampling methods involved do by themselves.

Experiment #3 – Piecewise Importance Sampling

Next we are going to do piecewise MIS. We are going to integrate y=\sin(3*x)*\sin(3*x)*\sin(x)*\sin(x) using three PDFs for importance sampling where each is just y=\sin(x)/2 shrunken on the x axis to be 1/3 the size and shifted over so that each PDF is responsible for one third of the function domain. The first PDF for example is y=\sin(3*x)*\frac{3}{2} from 0 to pi/3.

Here is the function we are integrating, showing the 3 zones the PDFs cover:

Here is the first of the PDFs. The other two look the same but are shifted over on the x axis.

Here is the variance for regular Monte Carlo versus the piecewise importance sampling, showing that it is a significant improvement to do the piecewise IS here.

Experiment #4 – Low Discrepancy Sequences

Unsurprisingly it turns out that low discrepancy sequences are useful when doing multiple importance sampling. It would be fun to look at using LDS in MIS / IS deeper in a future blog post, especially because things change in higher dimensions, but here are some interesting results in the mean time.

Here is the first experiment, which compared Monte Carlo (mc, blue) to importance sampling (ismc, yellow), now also using low discrepancy sequences for both.

For low discrepancy Monte Carlo (mclds, orange), instead of using white noise independent random numbers 0 to 1 to make my x values, I start the x value at a random number in 0 to 1 for the first sample x value, but then I add the golden ratio to it and use modulus to keep it between 0 and 1 for each subsequent sample. This is the “Golden Ratio Additive Recurrence Low Discrepancy Sequence”. That beats both Monte Carlo, and importance sampled Monte Carlo by a significant amount.

For low discrepancy importance sampled Monte Carlo (ismclds, green), I did the same, but put that sequence through the inverse CDF to generate numbers from that PDF, using LDS as input. It’s worked well here in 1D, but mixing LDS and IS can be problematic in higher dimensions due to the LDS being distorted from the importance sampling warping, and then losing it’s low discrepancy properties.

Here is the second experiment, which compared MC to IS to MIS, now including low discrepancy sequences:

Everything improved by using LDS, but interestingly, the order of best to worst changed.

Not using LDS, multiple importance sampling was the winner. Using LDS, MIS is still the winner. Since there are two streams of random numbers needed for the MIS (one for each importance sampling technique), I used a different low discrepancy sequence for each. For the first technique, i used the golden ratio sequence. For the second technique, I did the same setup, but used the square root of two instead of the golden ratio. The golden ratio is the best choice for this kind of thing, because it is the most irrational number, but square root of two is a pretty good second choice.

Not using LDS, Monte Carlo was the worst performing, but using LDS, Monte Carlo is in the middle, and it’s the first importance sampling technique that does the worst. The second importance sampling technique is in the middle whether you use LDS or not though.

Here is the third experiment now with LDS, which compared Monte Carlo to a piecewise importance sampled function.

This MIS here needs 3 streams of random numbers, so for the LDS, I used the golden ratio sequence, the square root of 2 sequence, and a square root of 5 sequence. Once again, LDS helps convergence quite a bit in both cases!

I’m starting to run out of “known good irrational numbers” so I’m glad we are at the end of the LDS experiments. There are other type of low discrepancy sequences that don’t use irrational numbers, but then you start having to consider the LDS quality along with the results and all the permutations. If you want to go into a deep dive about irrational numbers, give this article of mine a read:

Before moving on, look at that last graph again. The amount of variance that 5,000 white noise samples has is the same variance that piecewise importance sampling had, when using only 10 low discrepancy samples. Without LDS though, even the MIS strategy took something like 800 samples to reach that level of variance.

In graphics, these samples could easily represent rays shot into the world for something like global illumination, soft shadows, or raytraced reflections.

It would be real easy to try the most naive Monte Carlo algorithm, find out that you need 5000 samples to converge and give up.

Facing this, you may bust out the MIS and try to do better, finding that you could cut the cost to about 1/6 of what it was, at 800 samples needed to converge. That’s still a ton of samples for real time rendering, so is still out of budget. It would be real easy to give up at this point as well.

If you take it one step further and figure out how to get a nice LDS into the MIS instead of white noise random numbers, you could find that you can decrease it even further, down to 1/80th of what MIS gave you, or 1/500th of the cost of the naive Monte Carlo.

10 samples is still quite a few if we are talking about per pixel raytracing, but that is in the realm of real time affordable.

Good sampling matters, and can help you do some pretty amazing things.

Experiment #5 – Blue Noise

Where low discrepancy sequences are deterministic number sequences that give you good coverage over a sampling domain, blue noise is randomized (non deterministic) number sequences that do the same.

There is some nuance to LDS vs blue noise, and when one or the other should be used. The summary is that regular blue noise converges at the same rate as white noise (there are variants like projective blue noise which do better at convergence) but that it starts with a lower error. Blue noise also has better noise perceptually, which is also more easily filtered (it is high frequency noise only, instead of full spectrum noise). So, the rule in graphics is basically that if you can converge with LDS, do that, else use blue noise to hide the error. Blue noise also does better at keeping it’s desirable properties when put through transformation functions, such as importance sampling.

Unfortunately, blue noise is pretty expensive to calculate, especially with the algorithm I’m using for it, so the sample and testing counts are going to be decreased for these tests to 100 tests, using 500 samples each. Blue noise is best for low sample counts anyways, so decreasing the sample count makes for a more appropriate comparison.

Here is the first experiment, which compared MC to ISMC. Now it has blue noise results, to go along with the LDS results.

The result shows that blue noise does better than white noise, but not as good as LDS.

Here is the second experiment, comparing MC to MIS, now with blue noise. You can see how again the blue noise quality is between white and LDS as far as variance is concerned.

Here is the third experiment, showing the effectiveness of the piecewise importance sampling, using MIS. Once again, blue noise has variance between white noise and LDS.


Here are some other great links for learning about MIS via different points of views and different explanations.

Veach’s thesis that introduced MIS and goes into quite a few other options for MIS, as well as more rigorous proofs on variance bounds and similar

Thanks for reading!

Frequency Domain Image Compression and Filtering

Over 4 years ago I wrote a short blog post on images in the frequency domain:

It’s time to revisit the topic a bit and add some more things.

If you are curious about how the Fourier transform works, which can transform images or other data into the frequency domain, give this a read:

The C++ code that goes with this blog post can be found at

Image Compression

When you transform an image into the frequency domain, you get a complex number (with a real and imaginary component) per pixel that you can use to get information about the frequencies (literal sine and cosine waves) that go into making the image. One piece of information is the “phase” or starting angle of that wave. You get the phase by using atan2(imaginary, real). The other piece of information is the “amplitude” of that wave, or how large the wave is in the image. The amplitude is the length of the 2d vector (real, imaginary).

A quick and easy way to do image compression then, is to convert an image to frequency space, find the lowest amplitude frequencies and throw them away – literally zero out the complex number. If you throw enough of them away, it’ll take less data to describe the frequency content of an image, than the pixels of the image, and you’ll have compressed the image.

The more aggressive you are at throwing away frequencies though, the more the image quality will degrade. This is “lossy” compression and is a simplified version of how jpg image compression works. Lossy compression is in contrast to lossless compression like you find in png files, which use something more like a .zip compression algorithm to perfectly encode all the source data.

In the code that goes with this post, the DoTestZeroing() function throws out the lowest 10% amplitude frequencies, then the lowest 20%, then 30% and so on up to 90%. At each stage, it writes all complex frequency values out into a binary file, which can then be compressed using .zip as a method for realizing the image compression. As the data gets more zeros, it gets more compressible.

The top row in the image below shows an original 512×1024 image, the DFT amplitude information, and the DFT phase information. The bottom row shows the same, but for an image which has had it’s lower 90% amplitude frequencies thrown away. The DFT data is 8MB for both (uncompressed), and compresses to 7.7MB for the top picture, but only 847KB for the bottom picture. The inverse DFT was used to turn the modified frequency data on the bottom back into an image.

Here is another image which is 512×512 and has DFT which is 4MB uncompressed. The top image’s DFT data compresses to 3.83MB, while the bottom compresses to 438KB.

While fairly effective, this is also a pretty naive way of doing frequency based image compression!

More sophisticated methods use the “Discrete Cosine Transform” or DCT instead of the DFT because it tends to make more of the frequency magnitudes zero, consolidating the data into fewer important frequencies, which means it’s already smaller before you start throwing away frequencies. DCT and DFT also pretend that the images go on forever, instead of just stopping at the edge. DFT acts as if those images repeat in a tiled fashion, while DCT acts as if they are mirrored at each repeat, which can also be a nice property for image quality.

Other methods break an image up into blocks before doing frequency based compression. Also, you can use wavelets to compress images, or principle component analysis or singular value decomposition. You can also fit your image with “whatever” basis functions you want, using L1 norm regularization to promote the coefficients of your fitting to be zero, to make the fit data be less sparse, just like DCT does compared to DFT.

Another thing you can do is use compressed sensing to skip a couple steps: You take a couple randomized but roughly evenly spaced samples from the image (blue noise or LDS are going to be good options here), and then you can eg find Fourier basis coefficients (DFT!) that match the sparse/irregular data samples you took. This is like throwing out low frequencies, but without having to DFT the whole data set, and then throw things out. It starts with sparse data and then fits it.

Bart Wronski has several write ups on his blog in this area, so give them a read if you are interested:

This is a great read showing how to fit data using L1 regularization and all the related information you might be interested in:

This video is a great overview of the random grab bag of other things I mentioned:

Image Filtering

In my previous post on this topic I showed how you could throw away frequencies that were farther than a certain distance from the center to low pass filter an image, aka blur it. I also showed how if you threw away frequencies closer than a certain distance, it would high pass filter an image, aka sharpen it.

That throwing away of frequency data based on distance is the same as multiplying the frequency data by a mask which has a 1.0 in some places and a 0.0 in others. You can generalize this to multiply frequencies by any number. In the below I restrict the multiplications to be between 0 and 1, but you could definitely go to larger numbers or even go to negative numbers if you wanted.

The below shows the patterns that the images are multiplied by in this section. Top row left to right is a low pass filter, then a stronger low pass filter (gets rid of more high frequencies than the other) and lastly is a notch filter or “band stop” filter. The bottom row is the complement, such that you get the bottom by subtracting the image from white (1.0). Left to right, the bottom row is a high pass filter, then a weaker high pass filter (lets more low frequencies in) and then a band pass filter which only lets certain frequencies through.

First up is the “Loki and Alan” picture. Frequencies and actual picture values filtered from the pictures on the top are present in the pictures on the bottom and vice versa. In this way, blurring compared to sharpening (and edge detection) are two sides of the same coin. It just matters which part you throw away and which part you keep.

Here is what the frequency magnitudes look like. Note that each image has the magnitudes put through a log function, and also normalized to be 1.0 max. This is why even though the high pass filters (and band pass) darken the middle, it doesn’t seem like it. The renormalization obscures that fact a bit, and the middle is brightest (largest amplitudes) which we saw when throwing out the lowest amplitudes in the last section.

Here are the same filters applied to the scenery image. The top right image has some strange patterns in it if you look closely (click the image to view the full size in another tab).

Image Convolution

In the last section, we made “images” by using a distance function, to make values to multiply the frequencies by to filter out certain frequencies.

In this section, we are going to take two images, put them into frequency space, multiply them together, take them out of frequency space, and see what kind of results come out.

There is something called the “convolution theorem” which tells us that multiplication in the frequency domain, is the same as convolution between the images. Convolution is an expensive operation, because you have to loop through all the pixels of one image, and at each pixel, loop through the pixels of the other image, and do some multiplications and additions. Convolution is so slow, that it can actually be quicker to take two images you want convolved to frequency domain, multiply them together, and then take them out of frequency space to be images again.

Convolution is used in graphics for things like blurs, sharpening, or applying bokeh for depth of field, so speeding it up can be a big help! Convolution is also used in audio for things like reverberation which makes audio sound like it was played inside of a cave or a big cathedral.

Technical note: the “kernel” image needs to be centered at pixel (0,0), not the center of the image. Also, the kernel image should be normalized so that summing up all of it’s pixels adds up to 1.0. You also need to zero pad (add a black pixel border to) both the source image and kernel image to be the size of source+kernel+1 on the x and y axis before DFT’ing so they are the same size, and to avoid wrapping problems. After you are done multiplying and inverse DFT’ing, you can remove the black border again.

Here are the 4 images we are going to use as kernel images: A star, a plus, a circle, and a blob.

Here are the DFT magnitudes of those images.

Here is the “Loki and Alan” picture convolved with those kernel images.

You can see that the images somehow take on the qualities of the kernel… the star one is very angular, the plus one is very “plus like” and the circular one is very circular. Note how the blob acts a lot like a low pass filter! In frequency space, it does actually look like one, so that makes sense:

Here is the scenery picture convolved by the same shapes.

If you think the above looks weird when doing convolution on images, you should give a listen to convolution being used in audio. When used for reverb it sounds good, and sounds correct, but if you use it to convolve arbitrary audio samples together, you can get some really interesting and bizarre sounds! You can hear that here:

The dark border around the image is an artifact from adding a black border around the images to make them the right size (zero padding). If you instead just make the convolution kernel image as large as the image you are convolving (and that is already a power of 2, since this FFT requires that), you’d get the below, which has part of the image “wrapping” across from the other side.

If you used the DCT (discrete cosine transform) instead, it would MIRROR the texture instead of wrapping it, so you’d get more similar pixels to what should be there most of the time, compared to DFT which wraps. Another way to solve this problem though is if you are doing convolution in image space, instead of frequency space, is you can throw away any samples that go outside of the valid area of the images. You want to sum up the weight of the samples you actually took though in this case, and divide the final convolution sum by that weight, to normalize it. That will make pixels near the border have higher weights than they should, but it can be a less jarring artifact than the black border, wrapping, or mirroring artifacts.

Truth be told, many of the operations in this article can be done in a handful of lines of python. I find a lot of value in implementing things myself though, as it helps me internalize the ideas to better understand when and how to use them, and how to avoid problems/mysteries that come up when things are used as black boxes. I feel the tide turning though after a recent look at the sea of algorithms relating to SVD,PCA and finding eigenvectors. That is some crazy stuff, and way too much for a single person to deal with, while still trying to be competent in other topics 😛

When Life Gives You Lemons, Make Random Numbers

I saw this tweet go by on twitter and wondered – who of us hasn’t been in this situation, with things such as…

  • Novella comments on a blog post
  • stack overflow answers
  • “reviewer #3” response on a research paper
  • Yet another javascript framework
  • Or if it’s just review time and your boss needs to come up with some text to justify review scores that were pre-ordained by corporate politics, budgets constraints and the popularity contest farce they try to pass off as meritocracy.

What all these things have in common is that after the fact, you find yourself in possession of completely useless text.

That is, completely useless until today.

John von Neumann gave us the insight we need. If you have a biased source of bits, look at them in pairs. If they don’t match, you take the value of the first one as the unbiased bit to output. We can do this with whatever garbage we find has dropped into our lap, and turn that steaming heap of dung into something actually useful.

I implemented this and you can find the source code on github at

More info on the technique:

Here is some example output, using words from the great orator MacDonald Trump as input.

Remember… When life gives you lemons, make random numbers.

Small print: Yeah this is just a joke and the numbers coming out are only as random as the numbers going in. This technique is a way to turn biased uniform random numbers into unbiased uniform random numbers and works with “random” things like coin flips and dice rolls, but less so with things like text which will make patterns in output. You’d want to use Decorrelation to turn this from a joke into a real thing 🙂

Irrational Numbers

An irrational number is a number that can’t be represented as a fraction using integers for the numerator and denominator.

I’m a big fan of irrational numbers, and one of the biggest reasons for that is that they are great at making low discrepancy sequences, which give amazing results when used in stochastic (randomized) algorithms, compared to regular random numbers. Low discrepancy sequences are cousins to blue noise because they both aim to keep samples well spread out, but the usage case is different, so it’s situational whether to use blue noise or an LDS. That means that luckily there is room enough in the world for both LDS and blue noise.

This post is a random grab bag of things relating to irrational numbers.

Pythagorian cultists supposedly murdered someone to keep irrational numbers a secret, so this is technically forbidden knowledge, I guess.

Making Continued Fractions – Pi and Milü

Continued fractions are a way of writing numbers that can be useful for helping analyze irrational numbers, and rational approximations to specific numbers – whether they are rational or irrational.

If a continued fraction is infinitely long, that means it represents an irrational number, and all irrational numbers have continued fractions that are infinitely long. If a continued fraction is not infinitely long, that means it’s a rational number.

Here is the beginning of the continued fraction for pi.

Another way of writing continued fractions is to get rid of all the redundant “1 divided by…” and just write the integers you see on the left. The continued fraction for pi above would look like this, which is a lot more compact:


Let’s talk about how you make a continued fraction by walking through how to make it for pi, or at least 3.14159265359 anyways, since pi goes forever.

First up you take the integer part and use that as the first digit. Subtracting that integer part out leaves you with a remainder:

remainder: 0.14159265359

We take 1/remainder to get 7.06251330592. The integer part of this number is going to be our next number. Subtracting the integer part out gives us our next remainder:

remainder: 0.06251330592

1/remainder is 15.9965944095, which makes our next integer and remainder into:

remainder: 0.9965944095

1/remainder is 1.00341722818, so now we are at…

remainder: 0.00341722818

1/remainder is 292.63483365 so now we are at:

remainder: 0.63483365

1/remainder is 1.57521580653 and we’ll stop at the next step:

[3;7,15,1,292, 1]
remainder: 0.57521580653

Wherever you have a large number in the continued fraction, it’s because you just did one divided by a small number. The larger the integer means the smaller the remainder there was to make that integer.

The smaller a remainder is, the better approximated the number is at that step of the continued fraction.

This means that when you see a large number in a continued fraction, that if you truncated the continued fraction right before that large number, you’d have a pretty good approximation of the actual number that the continued fraction represents.

The larger the number, the better the approximation.

Because of this, looking at the continued fraction for pi, below, you can see that it has a pretty large number (292) pretty early in the sequence.


This means that the following continued fraction approximates pi “pretty well”.


We’ll cover how to figure this out further down in the post, but that fraction is actually 355/113 and has a special name “Milü”, found by Chinese mathematician and astronomer, Zǔ Chōngzhī, born 429 AD. It is within 0.000009% of the value of pi.

More info from wikipedia:

So, while pi is probably the most famous irrational number, it certain isn’t the most irrational number that there is, as it is so easily approximated by smallish integers in a division.

Quick note if you write a program to convert a floating point number to a continued fraction: If you try to make a continued fraction for a number that a floating point can’t represent exactly – such as 4.1 – you will get a very tiny remainder which then makes a very large integer when you flip it over. In my case, when using doubles, it made such a large number that converting it to an integer overflowed the integer. One way to address this is to just consider any remainder smaller than some threshold as zero. Like maybe any remainder less than 0.00001 can be considered zero.

The Continued Fraction of Phi aka The Golden Ratio

So even though pi is not the most irrational number, phi is (pronounced “fee”, and is aka the golden ratio). Yes, it is the most irrational number!

We’ll use the value 1.61803398875 and make a continued fraction.

remainder: 0.61803398875
1/remainder: 1.61803398875

remainder: 0.61803398875
1/remainder: 1.61803398875

remainder: 0.61803398875
1/remainder: 1.61803398875

Wait a second… do you see a pattern? the remainder is always the same value, and then when we do 1/remainder we always get the golden ratio again. That means this continued fraction is 1’s all the way into infinity.

In the last section we saw how having a large number in the continued fraction meant that a number was well approximated at that point in the continued fraction.

The golden ratio has a 1 for every number in the continued fraction, which is the smallest value you can have. That means it’s as least well approximated as you can be, at every stage in the continued fraction.

This is what is meant by the golden ratio being the most irrational number. It’s the least well approximated by rational numbers (dividing one integer by another).

When I said I liked irrational numbers I was mainly talking about the golden ratio, but other highly irrational numbers, like the square root of 2, are also nice. Highly irrational numbers have the properties that I like – the properties that make them useful to use as low discrepancy sequences.

An interesting thing you may have noticed above is that when you divide 1 by the golden ratio, you get the golden ratio minus 1. That is:

1 / 1.61803398875 = 0.61803398875

If you replace the golden ratio with “x” you get this equation:

1/x = x-1

If you solve that equation, you will get the golden ratio out. It is the only number that has this property! Well actually, -0.61803398875 does too, and relates to the fact that there is a + and – root solution to that quadratic equation, but maybe that’s not that surprising. 1/-0.61803398875 is -1.61803398875. The behavior is the same, it’s just happening on the other side of zero.

Something else interesting about the golden ratio is that if you square it, it’s the same as adding 1.


That gives you a formula:

x*x = x+1

If you solve that, you get the golden ratio again as the only solution (well, and the negative golden ratio minus 1 just like before). Probably not too surprising if you look at the formulas though, as you can use simple algebra to change one formula into the other.

From Continued Fractions To Regular Fractions

If you had a continued fraction and wanted to turn it into a regular fraction (or real number), you could run the process in reverse.

However, doing that means you start at the right most digit of the continued fraction and work left til you get to the beginning. How do you do this with an infinitely long continued fraction, like an irrational number would have?

Luckily there is a way to start at the left and work right, so that you can get approximations to infinitely long continued fractions (irrational numbers).

Let’s do this with pi, with this much of the continued fraction:


What you do is make a table which has a row for the numerator and denominator. The first numerator is 1, and the first denominator is 0. The second numerator is the first number in the continued fraction sequence, and the second denominator is 1.

The formula for the rest of the numerators and denominators is…

  • numerator[index] = CF[index] * numerator[index-1] + numerator[index-2]
  • denominator[index] = CF[index] * denominator[index-1] + denominator[index-2]

Doing that from left to right, you get this:

Below we do that with pi, golden ratio, sqrt(2) and an irrational number i came up with that is not very irrational, and is well approximated by 1.01.

A quick fun tangent is that you might notice that for golden ratio, both the numerators and denominators are the Fibonacci numbers. This is the link between the golden ratio and Fibonacci numbers.

Below shows the percentage error of each number, as more terms in the continued fraction are added.

Golden ratio error decreases the slowest which shows that the golden ratio is least well approximated by fractions. The made up irrational number that is approximately 1.01 has error decreasing the fastest because it is the least irrational of these numbers. Pi shows itself as having fairly low irrationality, and the square root of 2 is fairly irrational.

Here is the error on a log y axis to better show the difference in error.

It’s worth noting that 1/goldenRatio aka 0.61803398875 is every bit as irrational as the golden ratio itself is. The reason for this is that we are flipping the numerator and denominator over for this fraction that doesn’t exist. Flipping it over doesn’t make it more representable as a fraction.

When I’m using the golden ratio for low discrepancy sequences, i usually use that 1/goldenRatio value, also known as “the golden ratio conjugate” because the smaller value means it’s going to have fewer numerical issues and more precision, when working with small values (like between 0 and 1).

Also, there are an infinite amount of numbers just as irrational as the golden ratio. You can calculate them by calculating:

a' = (p*a+q) / (r*a+s)

where p,q,r,s are integers and the absolute value of p*s-q*r is 1.

That is the Moebius transform.

If you are wondering what the least irrational number is, it looks like there are multiple (infinite) and they are just numbers that are very, very, very well approximated by rationals. They are called Liouville numbers and are transcendental, meaning they can’t be calculated with polynomials basically. Proving the existence of these numbers also probed the existence of transcendental numbers themselves. (

There is also something called an irrationality measurement, but I haven’t found it that useful. It doesn’t seem to be something you can calculate with a program and compare two numbers to see which is more irrational.

Metallic Means

If all 1’s in a continued fraction make the golden ratio, what would all 2s or all 3s make? That makes the silver and bronze ratio respectively and all three of these are the first three of the “Metallic Means”. You can see the details of them below, along with pi. Wikipedia talks more about them too at

You might also be interested in reading about the “Plastic Number” which is a different way of trying to approach the goodness of the golden ratio to get another decently irrational number.

Where the golden ratio has a formula x^2 = x+1, the plastic number has a formula x^3 = x+1.

Coprime Numbers (Setup For Irrational LDS)

Coprime numbers are 2 or more integers that share only 1 as a factor.

For instance, 7 and 11 are coprime numbers. They also happen to be prime numbers, but coprime numbers don’t have to be prime. 8 and 15 are also coprime numbers although neither number is prime. You can even lump all four together and say that 7, 8, 11, 15 are coprime numbers. Combining lists of coprime numbers doesn’t usually result in a list of numbers that are still coprime. This just worked out because we added only prime numbers to the other list of coprimes, and adding primes will always keep the list of numbers coprime.

You might wonder why you should care about coprimes.

My favorite use of coprimes is when i need cheap shuffles of numbers.

For instance, let’s say you had 8 numbers and you wanted them to be shuffled into a somewhat randomized order, but it didn’t need to be a very high quality shuffle.

What you do is first pick any number that is coprime to 8… we’ll say 5. You then count an index from 1 to 8 and do this…

Out = (index * 5) % 8

Taking index from 1 to 8 gives you the following output:

5, 2, 7, 4, 1, 6, 3, 0

If you look at it a bit, you can probably find some patterns, but all the numbers are there and it is fairly mixed up wouldn’t you say? A casual observer probably isn’t going to notice any patterns if you used this in a game setting for dropping treasure or something.

Not all coprime choices give the same quality results though. Here is what happens if we use 7 as the coprime number instead of 5:

Out = (index * 7) % 8

7, 6, 5, 4, 3, 2, 1, 0

It just reversed the list! That isn’t very mixed up.

Also, let’s look at what happens if you don’t use coprimes. We’ll use 2 instead of 7.

Out = (index * 2) % 8

2, 4, 6, 0, 2, 4, 6, 0

If the numbers aren’t coprime, it won’t generate all the numbers in the list. Another way of looking at this is that the cycle length of the number sequence is maximally long when the numbers are coprime, but are not going to be as long (and will repeat) if the numbers are not coprime.

Note that you don’t have to use index values 1 through 8. You could use 0 through 7 instead or any other contiguous 8 values – like 127 through 134. The numbers in this case are all mod 8, so 127 through 134 is equivalent to 7 through 14.

Irrational LDS

An additive recurrence sequence is basically the same concept as the last section, but taken to the continuous world instead of sticking to discrete integers. You use the formula below, where A is any real number:

Out = (index * A) % 1

Just like before with coprime numbers, the value you multiply index by can make you have better or worse results.

If you pick 1/2 (0.5) for A, the sequence you’ll get out is this:
0.5, 0.0, 0.5, 0.0, 0.5, …

If you pick 1/4 (0.25) for A, you’ll get this:
0.25, 0.5, 0.75, 0.0, 0.25, 0.5, 0.75, 0.0, …

1/3 will get you this:
0.333, 0.666, 0.0, 0.333, 0.666, 0.0, …

2/3 will get you this:
0.666, 0.333, 0.0, 0.666, 0.333, 0.0, …

You can also do something more complex, like 5/8:
0.625, 0.25, 0.875, 0.5, 0.125, 0.75, 0.375, 0.0, …
(it then repeats)

If we picked 2/4, we’d end up with 0.5, just like we did when we used 1/2. When thinking about fractions to use in this setup, you should reduce the fraction, because unreduced fractions give the same value as reduced fractions. Ready for the kicker? The numerator and denominator in a reduced fraction will always be coprime integers – by definition of being a reduced fraction! Also, the denominator of the reduced fraction will tell you the length of the sequence it generates.

I snuck a little trick in the examples, did you catch it?

Let’s look at that sequence using 5/8 again, but let’s write the results as fractions.

5/8, 2/8, 7/8, 4/8, 1/8, 6/8, 3/8, 0/8, …

Now let’s look back at the sequence we made last section where we did (index*5)%8

5, 2, 7, 4, 1, 6, 3, 0

The numerator in the fraction results is the same as the integer one using coprimes. You could multiply the fractional results by 8 and get the exact same numbers. In fact take the formula we are using in this section:

Out = (index * 5/8) % 1

And multiply it by 8 to get the formula used in the last section!

Out = (index * 5) % 8

So, while we seemed to have moved away from integers into a continuous domain, we haven’t really… we are still in the land of coprime number sequences. We will always be in the land of coprime number sequences as long as we are using rational numbers, because there will always be a coprime numerator and denominator pair in the simplified fraction that defines the length of the sequence that comes out when we feed index numbers in.

So what if we ditch rational numbers and go with an irrational number? Well, using an irrational number means that the sequence will never repeat. While that in itself sounds awesome, let’s imagine we are using an irrational number that is well approximated by 1/4. That means that the sequence would never repeat, but it would NEARLY repeat every 4 values.

Let’s show the sequence using the golden ratio, where index goes from 1 to 8.
0.618034, 0.236068, 0.854102, 0.472136, 0.090170, 0.708204, 0.326238, 0.944272

Here is the same for a not very irrational number that we are going to approximate with 0.2490001.
0.249000, 0.498000, 0.747000, 0.996000, 0.245000, 0.494001, 0.743001, 0.992001

You can see that the sequence with the ~0.25 irrational isn’t EXACTLY repeating, but it sure is close to repeating. The golden ratio is making very unique numbers in comparison

So basically, if you use an irrational number that can be closely approximated by a rational number, it’ll behave a lot like that rational number and not be real great.

This is what is great about the golden ratio and other highly irrational numbers. There is no rational number that well approximates them, so not only do they not have any actual repeats, they also don’t have any NEAR repeats.

Using highly irrational numbers in this additive recurrence formula, you get a low discrepancy sequence.

Irrational LDS Offset

Every time you use an irrational LDS, you are going to get the same sequence, which can sometimes be a problem.

You can get around this by starting at a different index every time, or by adding a different starting value to the sequence.

Option #1: Randomized Start Index:

Out = ((index+StartIndex) * Irrational) % 1

Option #2: Randomized Starting Value:

Out = (Index * Irrational + StartValue) % 1

How you get this starting index or starting value is up to you. Honestly, I use white noise random numbers and it works fine, so I’ll keep doing that until I analyze it and find out how much better using an LDS is.

Anyways, if you recall from earlier in the post, this LDS won’t repeat so long as you use an irrational number. That means that the infinite number of integer index values going in means you get an infinite number of unique real numbers coming out. That means that the LDS can output any number possible just by changing the index you are at. So, offsetting by index is equivalent to offsetting by a start value, since there is some index that should give the same output as any specific starting value.

I thought that was true, but it turns out it isn’t, because integers are countably infinite while irrational numbers are a larger infinity.

It turns out that this is nearly true though, and that there always exists an index that will give you a value arbitrarily close to any specific starting value.

Although, this is only mathematically true. Computers have finite storage for integers and floats.

Anyways, it’s complicated, but either offsetting index or starting value should work just fine for getting a different sequence. I go the starting value way myself.

More info here regarding that mathematical topic:

BTW even though i show the formula for irrational LDS as index multiplying by the irrational number, that will have numerical precision problems as the index gets larger. You are way better off just adding the irrational and doing modulus when you want the value for the next index. The difference is huge, surprisingly!

Irrational LDS & Arbitrary Start Progressive Sequences

There is the concept of a progressive low discrepancy sequence, which means that any subset of the sequence, starting at index 0, will have the same properties as the entire sequence.

For instance, if you had a progressive blue noise sequence with 100 samples in it, using only the first 10 samples would be blue noise, or using the first 47 samples, or the first 73, or all 100.

If you had a non progressive blue noise sequence, the sample sequence wouldn’t be blue noise until all 100 samples have been used (or as you got close to that number).

Irrational low discrepancy sequences have a nice property that they are progressive, and they are progressive starting at ANY index, not just from index 0. This is because offsets in index are roughly equivalent to offsets in starting value, so starting in the middle of the sequence is the same as if you just started with that sample’s value as the starting value.

The golden ratio in particular is great for this. Each new sample from the golden ratio LDS falls in the biggest hole left by samples so far, which is why it’s so great for numerical integration (and similar). It has great coverage over the sampling domain. The crazy thing though, is that you can start at any index in the LDS and this is still true from that starting index, while it’s still also true if starting from index 0.

Check out the gif below to see what i’m talking about. There are 16 samples total. All 16 are shown on the left, but only the last 8 are shown on the right. I show the samples on a numberline since they are scalar values between 0 and 1, but I also show them on a circle to see that what i said about falling in the largest gap is true even with wrap around.

Here is the same setup for an animation using the square root of 2 instead.

Lastly, here is the same with pi. Pi is well approximated by 22/7 and since we are walking around a circle, the integer parts are irrelevant. Since 22/7 is 3 1/7, that makes pi act nearly like 1/7 and you can verify that it is indeed nearly repeating every 7 samples, and acting very much like 1/7. You can see how pi has samples that clump and it doesn’t give very good coverage over the sample domain for the number of samples it uses (compare the coverage to golden ratio!). This shows how a less irrational number isn’t as good at being a low discrepancy sequence.

If you have heard that plants grow leaves in a golden ratio pattern, now is probably a good time to explain why. Here is the last frame of the golden ratio gif, when there are 16 samples.

One of a plant’s goals in life is to get as much sun as it possibly can. Imagine that the sun is directly overhead and you are looking down at the plant from above. Toward this goal of maximizing sunlight, whenever light is allowed to hit the ground in the radius of the circle of the plant’s shadow, that is a lost opportunity. Sunlight is being wasted by hitting the dirt.

There may also be temporary or semi-permanent shadows that exist or come into existence at any point in the plants life, and if it put all it’s leaves such that they clump in one direction, all those leaves may become in shadow and the plant could go hungry.

A solution to these problems is to grow leaves equally distributed around the plant. This way there are no overlaps when viewing from above, and the risk of shadowing is minimized by being well spread out over all directions.

This would be the end of it if a plant was born with all of it’s leaves and it knew how many leaves it was going to need to grow – and that no leaves were going to get eaten.

Plants are dynamic life forms though, in a dynamic environment, so the leaves need to be as evenly spaced radially as possible when it’s a seedling with just a couple leaves, but also when it is much older with many leaves, possibly with some of them eaten by animals, and all that jazz.

A good way to do this would be to use the golden ratio to figure out where the grow the next leaf.

Doing this, it will have good coverage over the entire circle of possible growth, it’s ok if old leaves die off and go away since it will still be well distributed, leaves can grow larger because overlap will be minimized.

Check out the gif again and think of each line as a leaf growing around a central stem, as the plant gets taller. The circle on the right shows what happens as the first 8 leaves get old, die and fall off, being replaced by 8 new leaves.

If this talk about plants was interesting, give a read here about “Phyllotaxis”

A Connection Between Primes and Irrationals

So when we have this function where Out, A, B and index are integers:

Out = (index * A) % B

We will get a non repeating sequence of numbers B long from 0 to B-1, if A and B are coprime. Since prime numbers are coprime with all other numbers, we could also say that if A is prime, the sequence will have those properties. It isn’t required to be prime (just coprime) but let’s make the statement for primes right now.

If we move that to the real number world, where index is an integer, but A and out are real numbers:

Out = (index * A) % 1

If A is irrational, we will get a non repeating sequence of numbers infinitely long. In that way, irrational numbers are like primes in that when used, they make a non repeating sequence that is as long as possible.

More so, as A is more highly irrational, you get fewer “near repeats” as well, until getting to the golden ratio where you get a provably minimum amount of near repeats.

So, primes and irrationals (especially the golden ratio), are linked in that they make maximal length non repeating sequences when used over a field.

Co-Irrational Numbers

We talked about coprime numbers and how they would make a sequence that “went the full length”, that is, they made a sequence that didn’t repeat over the whole sequence. If you multiplied index by 5 and did a modulus by 8, you would get a sequence that was 8 items long and then would repeat.

We also talked about irrational numbers and how they would make sequences that NEVER repeated, and talked about how highly irrational numbers could make sequences that also didn’t have NEAR repeats.

This is great if you want one dimensional sequences, but what if you want two dimensional sequences or higher?

One way to do higher dimensional low discrepancy sequences is to just use a different irrational number per axis.

That is great but now you find yourself having to come up with a bunch of high quality irrational numbers since you need one for each axis. Sadly, there is only a single “golden ratio” really (Moebius transform and conjugate don’t give different behavior). If you used the golden ratio on each axis, you’d have correlation between the axes and it wouldn’t be great. Basically, each axis would have the same pattern, even if you start at a different index, or starting value.

There’s a concept I’m sure someone has thought of before, maybe going by a different name, that I call “co-irrationality”.

Coprime numbers have no shared factors other than 1, and thus “no shared cycle lengths” which make their output sequences be maximal lengths.

Similarly, co-irrational numbers should minimize how close they get to each other with their rational approximations. Where non coprime numbers used to make a sequence will repeat, two numbers that aren’t very co-irrational will make a sequence that will nearly repeat, which is nearly just as bad.

Just like how a coprime number doesn’t have to be prime, a co-irrational number doesn’t have to be very good as an irrational number in itself, and in fact, a rational number and irrational number could together be “co-irrational” to each other (like the golden ratio and 2). Also, you could have more than 2 numbers in a set that are co-irrational to each other.

The question of whether two or more numbers are coprime results in a black and white answer: a yes or a no.

The question of whether two or more numbers are co-irrational results in a grey answer. Numbers can range from not being co-irrational at all (if one is a multiple of the other), to being various levels of co-irrational, up to maximally co-irrational (again, the example of the golden ratio and 2).

So how can you tell if two numbers are co-irrational, or rather, how co-irrational they are?

Basically, divide one number by the other (doesn’t matter which is the numerator and which is the denominator) and then look at the irrationality of the resulting number. If the result is a rational number, the numbers are not co-irrational. If the result is an irrational number that is well approximated by dividing relatively small integers, the numbers are not very irrational. If the result is the golden ratio, the numbers are maximally irrational.

If you wanted to see how co-irrational 3 or more numbers were, one way could be to look at them pair-wise. You’d have to figure out how to combine the results of the multiple tests since you are testing each possible pair, but it should give you an idea of how co-irrational they were.

This might also give some thoughts on how you could come up with a set of N co-irrational numbers. Maybe gradient descent could be used to find N numbers that when looked at pairwise, gave the results closest to the golden ratio, and made the error be evenly split up between the pairs.

This section is light on evidence / experimentation / etc, but the post is already getting pretty long. It could be fun to look more deeply at co-irrationals in a separate post.


This is a really great read by Martin Roberts ( talking about generalizing the golden ratio and using that generalization for multi dimensional low discrepancy sequences, that is competitive with Sobol.

This talks about whether adding irrational numbers together or multiplying them together results in an irrational number:

This talks about how we know that e+pi is irrational or e*pi is irrational, or both, but that’s all we can say about it:

This shows a link between primes, Pascal’s triangle, and constructable polygons:

Here is an alternate take on how to measure co-irrationality:

Legend has it that a man named Hippasus was killed by Pythagorean cultists for proving the existence of irrational numbers, which destroyed their world view that all numbers were rational.

Here’s a video about that:

Here’s an article:

Here’s another article:

And another:

Numberphile has a nice video on the irrationality of phi too:

Using Low Discrepancy Sequences With Rejection Sampling

The code that generated the data for this post & implements the things talked about can be found at:

Rejection sampling lets you convert numbers from one probability distribution into numbers from a different probability distribution. It does that by throwing numbers away.

Imagine I gave you one hundred binary digits where 50% were zeros and 50% were ones. If you wanted them to be 75% zeros and 25% ones you could throw away 33 of the ones. That would leave you with 67 numbers where 50 were zeros and 17 were ones. Now, 74.6% of the numbers are zeros, and 25.4% of the numbers are ones. The transformation worked.

That change in distribution came at a cost though, the sequence got smaller.

I previously wrote up a post about rejection sampling here:
Generating Random Numbers From a Specific Distribution With Rejection Sampling

I also wrote about inverting a CDF here, which is a more complex method where you don’t throw numbers away.
Generating Random Numbers From a Specific Distribution By Inverting the CDF

When doing rejection sampling, a random number is compared against the probability for a number to survive, and the number is thrown away if it fails that test.

This post is going to look at what happens if we use low discrepancy sequences instead of random numbers (white noise) when working with rejection sampling. We are going to try substituting LDS for the random number generation to see if we should throw away a sample, and also we are going to use an LDS to generate the sequence we use as the source of rejection sampling.

Why LDS Here?

You might be asking what the motivation is for trying low discrepancy sequences here.

As a general rule, whenever I see white noise (regular random numbers) being used, it’s usually an indication that money is being left on the table and that the situation could be improved by using something else.

My heuristic there is roughly that if it’s for graphics, with sample counts that aren’t going to converge, that blue noise is going to be a good choice to make the remaining error be least noticeable, otherwise use low discrepancy sequences.

I’ve found this to be true for almost everything I’ve tried so far.

The only two exceptions I can think of at the moment are high dimensional Monte Carlo, where white noise seems to reign supreme (not my area though so not sure. I think Sobol can go pretty high dimension??), and also in random walks.

Random walks have problems with blue noise and LDS because they are so well distributed over the sampling domain, that the random walks never really leave the origin, which isn’t useful. I believe that random walks could possibly be helped by red noise and/or high discrepancy sequences (they are a thing that exist!).

I have two other more well motivated reasons for using LDS with rejection sampling though.

Firstly, if using uniform white noise random numbers to plot a histogram, it will be a flat line, matching that flat uniform PDF, but it will only do so at the limit of an infinite number of samples. In smaller number of samples, white noise is quite lumpy. Low discrepancy sequences on the other hand will make the histogram look a lot flatter with lower sample counts (arguably blue noise does a better job at very low sample counts too). LDS will match the shape of the PDF better with fewer sample counts, but will be a better match at higher sample counts too. So, working with probabilities, being able to have better statistical properties for smaller numbers of samples seems like a no brainer.

With rejection sampling specifically, if the area of your acceptance is A, and the area you are generating random numbers in is B, the probability of acceptance of a sample is A/B. White white noise, the average acceptance rate will be right after a large number of samples, but in a lower number of samples, there may be too many rejections or too little rejections which manifests as error. Using a low discrepancy sequence instead, you should always be closer to the correct acceptance rate than white noise, which also means lower error.

Going back to our situation of having 50 ones, 50 ones and throwing away 33 of the ones… a low discrepancy sequence will mean that the ones thrown away are roughly evenly spaced in the sequence. You can see how if you threw out the first 33 ones the averages would be right for the whole sequence, but that all the ones would be at the end, with none at the beginning, which is weird. White noise can cause similar things to happen, but a low discrepancy sequence will do better at making sure the ones thrown away are more evenly spaced across the whole sequence.

The second better motivated reason is this. Imagine that whenever you accepted a sample, your function emitted a 1, and when you rejected a sample, your function emitted a 0. Let’s also say that the area you are generating random numbers in is 1, which isn’t a stretch since it’s common to generate random numbers from 0 to 1 on each axis, which defines a (hyper)cube with area 1.

If you integrate this function, the result will be A/B… the acceptance probability.

If your goal was to integrate this function, using white noise, we know we’d get the usual white noise integration situation (slow convergence, high variance, etc). You’d know to use LDS in that situation to have lower error for the same sample count compared to white noise

Rejection sampling goes through the same motions as Monte Carlo integration, it just uses the output for a different purpose.

Explanation of motivation out of the way, let’s move onto the experiments!

Uniform To Linear

The first test is to use rejection sampling to convert from a uniform probability distribution to a linear probability distribution.

The linear PDF (Probability Density Function) is y=(2x+3)/4 for random numbers x being between 0 and 1. Being a PDF, that function integrates to 1 over that domain of x being from 0 to 1. For the purposes of rejection sampling, we want this function to be at most 1, instead of integrating to 1, so we are going to use the probability function y=(2x+3)/5

We want it to be at most 1 because we are essentially wrapping the function in a box that is 1×1, rolling a 2d random number to get a point in that box, and only keeping the sample if it’s underneath our function. So, the probability function needs to fit within our box by having all values be less than or equal to 1, but we also don’t want to waste space because it would cause more numbers to be thrown away than needed, so we need to make it as large as possible by making the largest value on the function be 1.

For our test, we are going to generate a number of samples in the linear distribution, by using rejection sampling on uniform distribution inputs until we have enough samples. From there we are going to break the range 0 to 1 into 10 sections and count how many numbers are in each section. That is going to give us a histogram. We are also going to subtract out the “expected” histogram value from the real PDF to show the error. We are also going to do this test 1000 times and show the average error and standard deviation.

We are going to do this for the following scenarios:

  • white/white – white noise used as the input stream, white noise used to get a random number for testing against the probability of keeping the sample.
  • white/LDS – white noise used as the input stream, but a LDS (golden ratio additive sequence) used to generate a “random number” for the probability test.
  • LDS/white – a LDS used as the input stream (square root of 2 additive sequence), white noise used for the probability test.
  • LDS/LDS – square root of 2 LDS used for the input stream, golden ratio LDS used for the probability test.

It’s also important to tell you that I’m using a (white noise) random number for the starting value of each LDS in every test. Without doing that, it would give the same results every time.

Here is the test for 100 samples generated.

Here is the test for 1000 samples generated.

Here is the test for 10000 samples generated.

Looking at the results, it shows that the clear winner is to use a low discrepancy sequence as input to your rejection sampling, while also using a low discrepancy sequence for the probability test.

Second to that is to use a low discrepancy sequence as input, while using white noise for the probability test.

Beyond that, with white noise input, it doesn’t seem to matter much if you are using a low discrepancy sequence for the probability test or not.

That was pretty surprising when I saw that. I was sure an LDS for the probability test would be useful, and it turns out it is, but we’ll see how in a little bit.

Uniform To Linear To Cubic

Let’s look at what happens if we convert a non uniform PDF to another PDF. To do that we need to generate the non uniform PDF first. An inversion method could have been used but I used rejection sampling. The type of sequence used to generate the linear PDF is the same used to convert the linear PDF to cubic.

The same tests are done as before. Here is 100 samples.

Here is 1000 samples.

Here is 10000 samples.

We got the same results as last time basically. Low discrepancy sequence as input is better than white noise as input, and if you are using LDS as input, it’s also better to use LDS for the probability test. If you are using white noise as input, it doesn’t seem to matter if you are using white noise or LDS for the probability test.

It is interesting to see though that the standard deviation (square root of variance) of LDS/LDS is noticeably higher for the higher numbers, when it was flat going from uniform to linear in the previous section. I think that area of the curve might be sensitive to problems because it’s a really likely section in the linear pdf but less likely in the cubic pdf.

As another test, here are 10000 samples of going from uniform straight to cubic to show the difference. Now there is a noticeable spike in LDS/LDS std dev in the lower numbers, which is a bit of a mystery.

As one final test, here are 10000 samples again of going from uniform to cubic, but i swapped the roles of the golden ratio and the square root of 2 LDSs. We seemingly get better results, which isn’t too surprising if you consider that the golden ratio is a better (more irrational) number than square root 2 is, and that from our previous tests, the quality of the input sequence seems to matter more than the sequence used for the probability test.

Survival Rate

We looked at the quality of the histogram coming out of the rejection sampling, but we didn’t look at how it rejects samples.

Here is the information about attempts vs samples generated from uniform to linear.

As you probably expected, LDS/LDS did best in this metric too. The main thing to pay attention to is the variance graph. Not only is the LDS/LDS variance graph lowest, it’s also flattest. This is a good thing because as we get farther to the right on the x axis, the variance of those samples sort of accumulate the variance of the samples before them too since we are looking at totals. That it’s flatter means that variance in future samples negates variance in previous samples.

Something else interesting is that in the last test where we looked at the histogram, the quality of the input mattered most to the metric we were looking at. Better quality inputs made better output histograms.

Here, the reverse is true. Better quality “random numbers” for the probability test make for better (lower) variance of sample survival.

LDS/LDS is best in both cases, but the one in 2nd place switches from LDS/white to white/LDS. 3rd and 4th place are basically tied, which is white noise and the unimportant LDS.

To give further proof & information, here is linear to cubic.

Here is uniform to cubic

Lastly, here is uniform to cubic, but swapping the roles of golden ratio and sqrt 2 LDSs.

The ripples in LDS/LDS are pretty interesting. I wonder what is causing them?

Bonus: Other 2D LDS

Lefteris Stamatogiannakis (@estama2) mentioned on twitter that it might be neat to see real 2D LDS’s used for this.

The golden ratio / sqrt(2) LDS’s seem to be pretty well suited for 2d use, but they aren’t a “designed for 2D” LDS like sobol or R2 is. (R2 is by Martin Roberts, from here:

So, I had a look! The results are mixed so I’ll let you make your own conclusions.

First, let’s look at the histogram average error / std deviation for the uniform to linear test.

Next, here is the uniform to linear to cubic test:

Here’s the uniform to cubic test. The graphs are mislabeled as uniform to linear to cubic. It’s really uniform straight to cubic. The last section goes to linear first, this one doesn’t.

Next up are the survival graphs.

First is the uiniform to linear survival data:

Next is linear to uniform to cubic:

Last is uniform to linear to cubic:

Interpolating Data Over Arbitrary Shapes With Laplace’s Equation and Walk on Spheres

A very cool paper was accepted to SIGGRAPH 2020 called “Monte Carlo Geometry Processing: A Grid-Free Approach to PDE-Based Methods on Volumetric Domains”

The paper is here:

You can find the authors on twitter at

The paper shows some really interesting things so I set out to learn more about it and arrived at the topic of this blog post.

In this blog post we are going to take common ideas of interpolation, like:

  • linearly interpolating between two points on a line
  • linearly interpolating between 3 points on a triangle using barycentric coordinates – or interpolating between N points on a simplex (including on a line, which is the last bullet point)
  • Bilinear interpolation between 4 pixels when you zoom into a texture – or N-linear interpolation between pixels for volume textures and mipmaps

…and we are going to generalize them to interpolating data over any shape, where values are specified on the boundary of the shape (not the vertices!), and interpolated through the interior of the shape. Furthermore, the shape doesn’t even need to be contiguous, or closed.

Unfortunately, unlike the interpolation methods I mentioned above, this interpolation isn’t as friendly for real time applications.

Despite the mathy nature of the topic, I’m going to try to keep it real easy to understand, so don’t get scared off!

Introduction – Laplace’s Equation / Dirichlet problem

Let’s start with the mathy bit, in a very math light way. The below describes the mathematics of our interpolation, which is a Dirichlet problem to solve Laplace’s Equation if you wanted some terminology.

Let’s talk about it’s two parts independently:

Section #1 (The Blue Box)

  • On the left, it says “The Laplacian of the function is 0…”
  • On the right, it says “…if the point is inside the shape (not on or outside the boundary)”

The Laplacian being zero inside the shape means that if you take an average of the values around any point inside the shape, they will average to the value at the center point.

If this was a 1 dimensional function, it would mean that if you pick a number on the graph, take a step to the right and the value was 1 higher than the current number, then the number if you take the same sized step to the left must be 1 lower than the current number (making sure to stay inside the boundaries). Below is a function y=3x+2 which has a Laplacian of 0.

In 1D that means the graph is a straight line (linear) but in higher dimensions, you can get more interesting shapes. Below is a 2d function z=e^x \sin(y) which has a Laplacian of 0.

You can see the graph here:*sin%28y%29

You can see that it has a Laplacian of 0 here:*sin%28y%29

I got that function from this great Khan Academy video:

So, section 1 doesn’t exactly say that our function is linear, but it does give a certain smoothness to the interior that feels a bit like a linear function.

To me, I think of how a bilinear interpolation is linear on the x and y axis, but can go quadratic as soon as you get off of those axes. Having a Laplacian of zero seems to satisfy the spirit of that “linear on an axis” property, without being limited to a specific axis.

Section #2 (The Green Box)

  • On the left it says “The function has specific values…”
  • On the right it says “…if the point is on the border”

All this means is that there are values along the border, and it doesn’t matter how those values are defined.

Putting it Together

So section 2 says there are values on the border of the shape, and section 1 says that any point inside the shape is an average of the values around it – even if that point is near the border. This defines how our interpolation works.

So, let’s take a walk. Pretend you started at a border and walked across the shape in a straight line until you hit the border at the other side of the shape.

When you started walking, you’d have the value of the border where you started. As you walked into the shape, the value would mix with values from other nearby borders (basically weighted by their distance to you).

When you hit the midway point of your walk, the border you started at and the border you ended at would have equal weighting in the current value you had, but other border values would be contributing as well.

When you got closer to the border you are ending at, the value there would start to weigh more heavily into the value, until you finally reached the destination point of your walk, and the value was just the border value there.

This essentially is our interpolation, as described by those two mathematical statements.

How To Do It – Walk On Spheres Algorithm

If you have ray marched signed distance fields, prepare to have your mind blown. There is a huge similarity here 🙂

To interpolate data in a shape, where the data values are defined by values at the boundary, you need to be able to get a distance from a point to the nearest border. You don’t need to be able to get a direction, just the distance. You need a (signed) distance field.

If you can’t get an ACTUAL distance to the border, but can get a conservative estimate (an estimate that is guaranteed to be <= the actual distance), that works fine too.

Here is a neat article from Inigo Quilez that talks about how to get that kind of an estimate using function gradients:

So the algorithm is to start at your point and…

  1. Get the distance from the point to the border. This defines a sphere representing the largest step you can take without leaving the shape. (sphere if 3d, circle if 2d, hypersphere if 4d or higher, line segment if 1d)
  2. If the distance to the border is smaller than some epsilon, return the value at the border there.
  3. Else, take a step of that distance in a uniform random direction
  4. GOTO 1

If you do that full loop 1 time, you will get a single value. This is 1 sample.

If you do that N times and average the result, you’ll get a value that approaches the correct value. This is where the “monte carlo” part mentioned in the paper comes in. We are doing monte carlo integration, where each sample is a random walk in the shape from the starting point to whatever part of the border it exits at.

Higher values of N give more accurate results with less error.

That is all there is to it!

It works because the likelihood of hitting an specific boundary point depends entirely on how far it is from your starting point. Closer boundary points are more likely to be reached, while farther boundary points are less likely to be reached.

If you are doing this process for an image, you would run this N times for each pixel individually, to get a noisy result that got less noisy as you took more samples, just like path tracing. You could even use denoising on the image before it converged.

A non obvious detail is that the shape doesn’t need to be convex. It can be concave which means that the random walk needs to “go around corners” to reach borders. The distance used for interpolation weighting is the “around the corner” distance, not a straight line distance where the straight line exits the shape and then re-enters it.

Another interesting thing is that the shape doesn’t strictly need to be closed. You could have a U shape with values along the U, and this algorithm works just fine. It just now applies to shapes inside of the bowl of the U as well as the points outside the bowl of the U, off into infinity. (Is this still interpolation? it isn’t quite extrapolation I don’t think… hrm!)

You could also have multiple shapes that aren’t connected scattered around with values on them and this algorithm still works just fine.

Being Monte Carlo integration, you can apply ideas like importance sampling, and Russian Roulette too. Importance sampling to reach boundaries more quickly. Russian Roulette to kill paths that are wandering away form useful areas in an unbiased way.

I haven’t found a way to apply low discrepancy sequences for faster convergence though, nor have I found a way to apply blue noise for better perceptual error. The search continues though and there are some resources regarding LDS in the links section.

Examples – My Shadertoys

I wrote a shadertoy which uses this technique to do 1d interpolation from red to yellow. You could do the same thing with a lerp easily, but it is a simple demonstration of the technique.

The shadertoy is at

Here it is after 1 sample. Every pixel has done one random walk on a numberline.

Here it is after 10 seconds, which is 600 samples, because of vsync locking it to 60fps.

I also made a shadertoy which works in 2D using 2D SDFs, with a handful of scenes (there is a SCENE define in the common tab to control which is being viewed) at

From that shadertoy, here is a circle which uses a sine wave around the border to make sections of white and black. This is the noisy unconverged version.

Here it is more converged.

Here is a version using rainbow colors around the border:

Here is a setup using 3 lines to make a triangle, where each line has a different color on the inside and the outside. Even though the lines are boundaries, there is an “inside and outside”, so the boundary isn’t a typical boundary.

Here is the same setup, but opening the triangle into a U shape, to show it works with shapes that aren’t closed.

Here are 3 circles and two lines. The lines are black colored on the side facing the center of the screen, and white colored on the outside.

A 3d version will be shown in the next section!

Examples – Other Shadertoys

The website for the paper on monte carlo geometry processing links to 3 really interesting shadertoys. I want to explain what they are doing real quick, because it is pretty clever stuff.

The first is by inigo quilez (@iquilezles).

This shadertoy does the same things my 2d shadertoys were doing. There are two lines that cross each other, and have a different color on each side.

The next is also by inigo.

Believe it or not, this is also doing the same thing as the last shadertoy, and my 2d shadertoy. It’s just interpolating colors between lines. Below is the lines without the interpolation/diffusion!

The last shadertoy is by Ricky Reusser (@rickyreusser).

This one is actually doing the algorithm in 3d. if you are looking at the outside of the shape, those yellow/black zebra stripes are the boundary conditions. The pickish color in the middle is the interpolated values, but they are tinted slightly red, as a variation on the plain vanilla algorithm.


So, in this blog post we talked about how to use monte carlo integration and the walk on spheres algorithm to interpolate data between very arbitrary borders in any dimension.

This is a large part of the basis of the paper on monte carlo geometry processing, but it goes a lot farther than this as well. I don’t really understand how the other things shown in the paper use WOS to do what they do. As I learn more, I may make more shadertoys and make other blog posts.


Since (signed) distance functions are really important to this technique, you might be wondering where you can go to read more about them. Luckily inigo quilez has you covered, and has been writing stuff about them for years. He talks about them in the context of ray marching, but that is an extemely similar usage case. Basically just imagine that instead of a random direction walk, you walked a specific direction down a ray, from the ray origin, to see what the ray hit. That is ray marching in a nutshell.

Here is information from him about 2d distance functions:

Here is information about 3d distance functions:

Here is another write up on this topic
“Walk on Spheres Method in Julia” from Philip Zucker (@SandMouth)

Apparently harmonic coordinates are related, and seem to be a generalization of barycentric coordinates.
“Harmonic Coordinates for Character Articulation” from Pixar

I believe these techniques can be applied to Poisson blending, where you paste pixel deltas (instead of absolute values) into a destination image, so that what you are pasting takes on the lighting and coloring of the destination image.

Poisson blending and others are talked about in this next paper. WoS can solve the Laplacian equation, but it can also solve a generalization of the Laplacian equation called the Poisson equation, so Poisson is very related.
“Poisson Image Editing”

Click to access 2003_siggraph_perez.pdf

“Walk on Spheres Method” on wikipedia

Regarding trying to apply low discrepancy sequences (or blue noise!) to the random walk on spheres, here is some interesting info about quasi random (aka LDS) random walks.

Thanks for reading!

Weighted Round Robin (Weighted Random Integers) Using The Golden Ratio Low Discrepancy Sequence

UPDATE 6/25/2020: The bottom of the post now has a section about using 2D low discrepancy sequences with an alias table and compares results vs the 1D methods. The code has been updated as well to include that stuff.

In this post we are going to talk about Round Robin first, then Weighted Round Robin.

The ~500 lines of standalone C++ that generated the data for this blog post can be found at

Round Robin

Let’s say you find yourself in one of these 2 equivalent scenarios:

  1. You want to repeatedly choose from a list of 10 objects in a fair way, where the objects are chosen at roughly the same rate – like, a loot drop table for killing monsters in a game.
  2. You have 10 queues of work that you want to process work items from in a fair way, where work is consumed from the queues at roughly the same rate.

In either case you want a stream of integers between 0 and 9 where ideally any section – large or small – should have roughly even counts of those numbers chosen. In other words, it should have a flat histogram for both large and small sections.

One way to deal with this is to just walk through the numbers 0 through 9 over and over and over. You could increment a number whenever you wanted the next index and use modulus to ensure it was between 0 and 9.

I’m calling this “Sequential” in the graphs and that would work, but depending on your usage case, it might be undesirable that it was so obviously going in order from 0 to 9.

Another way to deal with this would be to roll a random integer 0 through 9 and use that as your answer. In the code, I roll a random float between 0 and 1 and remap that to an integer 0 through 9 so it’s more like the other methods. I’m calling this “White Noise” in the graphs and that also would work in that the histogram is going to be flat after an infinite number of samples, but it won’t be very flat in small numbers of samples.

Yet another way to deal with this is to use a low discrepancy sequence (deterministic or randomized aka blue noise) such as the golden ratio additive recurrence low discrepancy sequence which looks like this:

float GetNextValue(float x)
  float newX = x + 1.61803398875f;
  return newX - floor(newX);

In the above, the first value you ever plug into this function can be any number. It can be 0.0, but it doesn’t have to be.

1.61803398875 is the golden ratio and since we are dealing with values between 0 and 1, you could also use 0.61803398875 instead which is the fractional part of the golden ratio, but is also 1 divided by the golden ratio (the golden ratio conjugate).

The golden ratio is a good choice here because it’s an irrational number, meaning that from a mathematical point of view (ignoring limitations of computers for a second) it won’t ever repeat values.

The golden ratio is also an excellent choice among irrational numbers because it is the most irrational number. Less irrational numbers – like pi – still don’t repeat but they have “near repeats” which make them less good of a choice. Then there are numbers like the square root of 2, which are pretty decently irrational – more than pi, but less than the golden ratio.

An alternate way to write the code is like this, which gives you “random access” to the sequence, but is more prone to having floating point precision issues.

float GetValue(int index, float seed)
  float newX = seed + float(index) * 1.61803398875f;
  return newX - floor(newX);

If you use this in our problem for generating floating point values between 0 and 1 and remapping to integers 0 through 9, you will get fairly even counts of occurrences of those integers (a fairly flat histogram) from both small and large sections of the sequence.

Here is a screenshot of the program generating a sequence of 80 items, using the techniques talked about so far.

It’s hard to tell what the histogram looks like from that, but it is still a bit interesting looking at the difference in patterns of the numbers chosen. There are seeming patterns even in the golden ratio because the values are quantized to 10 possible values.

Here are histograms of 100 samples. It shows that white noise is doing noticeably worse than the irrational numbers, but the irrational numbers don’t have a winner that is easy to discern with the naked eye. Sequential can’t even be seen because it is by definition a completely flat histogram and so is a flat line at y=0.1.

Here are histograms of 10,000 samples. White noise and pi look about the same, but golden ratio and square root of 2 are much closer to the actual flat histogram value / sequential sequence.

Here are histograms of 1,000,000 samples. White noise is way worse than everything else, but it’s hard to see a winner from the rest of the group.

Here we remove white noise from the graph to see the others better. Pi shows being the worst. Square root of two is next, and then the golden ratio which is closest to the sequential / flat histogram line.

Round Robin vs Random Integers

When using white noise, round robin is just random integers. It’s the same as rolling dice. So what is the difference?

In the last section I say that using a low discrepancy sequence is better than using white noise, because it reaches the target histogram more accurately with fewer samples – fewer rolls of the dice.

The downside to this though is that the numbers are no longer random… a person viewing the sequence of numbers can almost certainly guess what the next number is going to be.

So, that makes it not useful to use these for places where randomization is important – like when gambling or for cryptography.

But, where this is useful is for places where fairness matters but randomness does not.

One example could be pulling work items from a work queue. Who cares if it is possible to guess which thread is going to get assigned the next item of work? The important thing is that the work is spread evenly.

Another example is in numerical integration in monte carlo integration, and also other stochastic rendering situations. This is more what I care about. In these situations, you don’t care if the choice can be predicted or not, but if you can reduce the amount of error vs the target distribution in fewer samples that means you’ll have less noise in your resulting image, which is ::chef kiss:: oh so great. A better image for the same processing power.

If you were in a situation that wanted randomness, but you still wanted some of these benefits, you should check out “blue noise”, which is a randomized version of low discrepancy sequences. A good place to start might be my post on generating blue noise in 2 dimensions, which you could modify to only generate 1 dimensional blue noise and use for this purpose.

Weighted Round Robin

Let’s say you find yourself in the same situation as before but now there are weighted probabilities involved.

  • For the loot drop situation, different loot should be awarded with different probabilities.
  • For the work queues, you could have different powered CPUs servicing the work queues maybe. More powerful processors should be more likely to get work served to them perhaps.

For our situation, we’ll say that the weight of item 0 is 1, the weight of item 1 is 2, and so on, up to the weight of item 9 being 10.

Let’s check out our options again.

First, we could go sequentially. This means that our sequence would go like this: 011222333344444… etc.

A possible issue with this (again, depending on your specific needs) is that it is even, but it goes through every occurrence of a number before moving to the next number. If it was desired that the numbers be visited in a more interleaved fashion, you’d want to try something else.

The next thing you could try is to take the item weights and divide them by the total of the item weights, so that they add up to 1.0. This would make them normalized. From here, we could roll a random floating point number between 0 and 1, find the item that has the range that contains that floating point number, and use that as the selected item.

That would solve the problem I mentioned, but would introduce the problem of white noise: while large sample counts have the right histogram (distribution), small sample counts don’t.

We could also use irrational numbers again, which would do better at reaching the correct histogram than white noise but become deterministic like sequential, but still shuffled “a bit”.

Here’s the program output for 80 item sequences of the various techniques:

Here are histograms with 100 samples. The irrationals are all doing pretty well. White noise is not doing great. Strangely, sequential is also not doing real great for the last value because the sequence length doesn’t divide 100 evenly this time!

Here are histograms with 10,000 samples. It’s harder to see for sure who is winning by nature of the graph, compared to the last section, but we can see white noise is not doing great.

Here are histograms with 1,000,000 samples. It’s nearly impossible to see anything about it.

Here is the graph of subtracting sequential from the others. It shows that white noise is by far the worst.

Removing white noise from that graph, we can see that pi is worst, golden ratio is best, and square root of 2 isn’t far behind the golden ratio.

Old Closing (Before Alias Table Section Added)

If you look at the code, you might be saying to yourself that the weighted version isn’t great because you need a loop to convert the 1D floating point value in [0,1] to the weighted item chosen. If you switched to using an alias table, it’d be an O(1) constant time operation, but then the 1D value would not be enough – you’d need a 1d value and a biased coin flip. AKA just a 2d value that is [0,1] on each axis. It would be interesting to compare an alias table / 2D LDS sequence to see how it converged versus this method. I really, really, should have done that in this post and should do it soon.

Another option though would be to use a “branchless, fixed step count, binary search” which would also be a constant time lookup. Basically, if you had 16 items you were searching through, you can do it with four branchless operations. This is great for shaders, SIMD programming, hardware, and similar. You can read more about that technique in my post here:

Alias Tables & 2D Low Discrepancy Sequences

This section shows the effect of using 2d low discrepancy sequences with an alias table to find the weighted item in constant time O(1), instead of having to loop through the weights to find the correct item.

Using an alias table is faster, but having to use 2d low discrepancy sequences instead of 1d has the possibility to change things. The concept of evenly spaced and similar ideas gets a lot more open to interpretation when you leave the first dimension.

So, first things first, here’s a graph of 1d white noise compared with 2d white noise used to do a lookup into an alias table, for 10,000 samples. This is to show that the alias table works just like not using one, and that we are doing apples to apples comparisons. I’ve also added a “weights” to the graphs as the actual ground truth of the histogram, instead of relying on sequential which was imperfect.

I’m using three 2d low discrepancy sequences. They are:

  1. R2 – a generalization of the golden ratio to 2 dimensions. (more info here:
  2. Sobol – a very strong and well known LDS, which is often the best LDS shown in the papers i’ve seen (except when reading blue noise papers, but that’s a different usage case and quality metric hehe)
  3. Golden Ratio / Square root of 2 – I use the golden ratio sequence on the x axis and the square root of 2 sequence on the y axis.

Here are the histograms for 100 samples, comparing the 3 LDSs vs white noise and the actual weights. 1D golden ratio is also here to compare how the 2d alias method techniques compare to the 1d golden ratio sampling. It seems to be showing that white noise and Sobol are the worst outliers, and everything else is better behaved.

Here are 10,000 samples. I’m switching to error (subtracted out “weights”) so it’s easier to see small differences. Now alias table white noise is the single outlier, and everything else seems to be playing pretty nicely and all just about even.

If we remove white noise we can zoom in and see things a little better. It looks like 1D golden ratio is the best, then Sobol. It’s hard to tell, but R2 seems like it might be doing better than golden ratio / square root of 2.

Here is the same for one million samples. 1D golden ratio seems to be the winner here and the others all seem to be doing about as well as each other.

Conclusion: If you are using an alias table, you definitely should use a decent 2d low discrepancy sequence. However, it looks like you will not do as well quality wise, as using 1d golden ratio sampling without an alias table. Maybe there is a different sampling pattern to use in the alias table case that will do better though. If you know of such a thing, please share!

I wanted to show one more thing though before we go. You might notice the graphs all call R2 “additive”. If you are wondering why that is, it’s this 2D irrational low discrepancy sequence can be calculated two different ways, just like the 1D golden ratio low discrepancy sequence can. You can either multiply an index by the irrational number(s) and fract it to get the result, or you can start at index 1, add the irrational number(s) in, fract it, then go to index 2, repeating until you get to the index you want.

Doing it the first way with a multiply causes you to run into numerical issues (error due to floating point rounding, due to finite memory storage for floats), while the second has better precision. The second way is “additive” and i want to show you just how big a difference this makes by showing you the error of both kinds after 1 million samples, along with the error for white noise.

The error from r2 done via multiplication is so large that it dwarves even the error from white noise. It isn’t as noticeable at lower sample counts, because the floating point numbers are smaller, so will hit less severe rounding problems, but it still exists at lower sample counts. This isn’t unique to R2 either… this is true of any irrational additive recurrence low discrepancy sequence (including the 1D golden ratio sequence).

You probably would have better luck with doubles.

Another thing you can do is make it into a fixed point “unsigned int” representation of the irrational so that you don’t waste storage bits on features you don’t need from floats. More info on that here:


If you found this post interesting, you might like this other post, which talks about using LDS and blue noise for making loot drop tables, and how that helps the average player experience be more predictable, compared to regular old random numbers.

“Using Low Discrepancy Sequences & Blue Noise in Loot Drop Tables for Games”

There is a great video from numberphile that talks about why the golden ratio is the most irrational number. Check it out 🙂

When making the alias table, i used the stable Vose method from this page:

Casual Shadertoy Path Tracing 3: Fresnel, Rough Refraction & Absorption, Orbit Camera

Posts in this series:

  1. Basic Camera, Diffuse, Emissive
  2. Image Improvement and Glossy Reflections
  3. Fresnel, Rough Refraction & Absorption, Orbit Camera

Below is a screenshot of the shadertoy that goes with this post. Click to view full size. That shadertoy can be found at:

On the menu today is:

  • Fresnel – This makes objects shinier at grazing angles which increases realism.
  • Rough Refraction & Absorption – This makes it so we can have transparent objects, for various definitions of the term transparent.
  • Orbit Camera – This lets you control the camera with the mouse to be able to see the scene from different angles


First up is the fresnel effect. I said I wasn’t going to do it in this series, but it really adds a lot to the end result, and we are going to need it (and related parameters) for glossy reflections anyways.

Fresnel makes objects shinier when you view them at grazing angles and helps objects look more realistic. In the image below, the left sphere does not have fresnel, but the right sphere does. Fresnel adds a better sense of depth and shape.

The fresnel function we are going to use is this, so put this in common or buffer A above the raytracing shading code.

float FresnelReflectAmount(float n1, float n2, vec3 normal, vec3 incident, float f0, float f90)
        // Schlick aproximation
        float r0 = (n1-n2) / (n1+n2);
        r0 *= r0;
        float cosX = -dot(normal, incident);
        if (n1 > n2)
            float n = n1/n2;
            float sinT2 = n*n*(1.0-cosX*cosX);
            // Total internal reflection
            if (sinT2 > 1.0)
                return f90;
            cosX = sqrt(1.0-sinT2);
        float x = 1.0-cosX;
        float ret = r0+(1.0-r0)*x*x*x*x*x;

        // adjust reflect multiplier for object reflectivity
        return mix(f0, f90, ret);

n1 is the “index of refraction” or “IOR” of the material the ray started in (air, and we are going to use 1.0). n2 is the “index of refraction” of the material of the object being hit. normal is the normal of the surface where the ray hit. incident is the ray direction when it hit the object. f0 is the minimum reflection of the object (when the ray and normal are 0 degrees apart), f90 is the maximum reflection of the object (when the ray and normal are 90 degrees apart).

For the index of refraction of objects, i used a value of “1.0” in the image with the 2 orange spheres above. for f90, the maximum reflection of the objects, we are going to just use “1.0” to make objects fully reflective at the edge.

Below are spheres with a minimum reflection (reflection chance!) of 0.02, and the IOR go from 1 (on the left) to 2 (on the right). To me, the one on the right looks like a pearl. (this is the SCENE define value 2 scene in the shadertoy)

To apply fresnel to our shader, find this code:

// calculate whether we are going to do a diffuse or specular reflection ray 
float doSpecular = (RandomFloat01(rngState) < hitInfo.material.percentSpecular) ? 1.0f : 0.0f;

Change it to this:

// apply fresnel
float specularChance = hitInfo.material.percentSpecular;
if (specularChance > 0.0f)
    specularChance = FresnelReflectAmount(
        rayDir, hitInfo.normal, hitInfo.material.percentSpecular, 1.0f);  
// calculate whether we are going to do a diffuse or specular reflection ray 
float doSpecular = (RandomFloat01(rngState) < specularChance) ? 1.0f : 0.0f;

Besides that, a float for "IOR" needs to be added to SMaterialInfo too. Here is the scene from the end of last chapter, using fresnel and with everything using an IOR of 1.0.

And here is the scene from last chapter without fresnel. The difference is pretty subtle so you might want to open each image in a browser tab to flip back and forth. The biggest difference is on the edges of the yellow and pink sphere. Fresnel makes the biggest difference for objects that are a little bit shiny. Objects that are very shiny or not at all shiny won't have as noticeable fresnel effects.

Better Diffuse / Specular Selection

In the last post we used a random number tested against a percent chance for specular to choose whether a reflected ray was going to be a diffuse ray or a specular ray.

When doing that, we should have also divided the throughput by the probability of the ray actually chosen (to not make one count more than the other in the final average) but we didn’t do that. This is a “mathy detail” but easy enough to put in casually, so let’s do the right thing.

Find the place where the doSpecular float is set based on the random roll. Put this underneath that:

// get the probability for choosing the ray type we chose
float rayProbability = (doSpecular == 1.0f) ? specularChance : 1.0f - specularChance;
// avoid numerical issues causing a divide by zero, or nearly so (more important later, when we add refraction)
rayProbability = max(rayProbability, 0.001f);     

Then put this right before doing the russian roulette:

// since we chose randomly between diffuse and specular,
// we need to account for the times we didn't do one or the other.
throughput /= rayProbability;

The biggest difference from the previous image is that the pink and yellow spheres brighten up a bit, and so do their reflections, of course!

Rough Refraction & Absorption

Time for the fun! We’ll talk about the features and show the results in this section, then show how to get them into the path tracer in the next section.

Thinking about the last post for a second… in simple rendering, and old style raytracers, reflection works by using the reflect() formula/function to find the perfectly sharp mirror reflection angle for a ray hitting a surface, and traces that ray. In the last post we showed how to inject some randomness into that reflected ray, by using a material roughness value to lerp between the perfectly reflected ray and a random ray over the normal oriented hemisphere.

For refraction we are going to do much the same thing.

For simple rendering and old style raytracers, there is a refract() formula/function that find a perfectly sharp refraction angle for a ray hitting a surface, where that surface has a specific IOR (Index of refraction. Same parameter as used for fresnel). The only difference is that since refracted rays go THROUGH an object instead of reflecting off an object, the random hemisphere is going to be oriented with the negative normal of the surface.

Below is an image of refractive spheres going from IOR 1 on the left, to 1.5 on the right. Notice how the background gets distorted as the IOR increases, and also notice how the light under the spheres gets focused. Those are “caustics” and get way more interesting with other shaped meshes. (this is SCENE define value 1 in the shadertoy)

You might notice some dark streaks on the ground under the spheres in the middle. When i first saw these, i thought it was a bug, but through experimentation found out that they are actually projections of the images in the sky (the top of the skybox).

To further demonstrate this, here’s a scene where the spheres are all the same IOR but they are different distances from the floor, which focuses/defocuses the image projected through the sphere, as well as the light above the spheres. (SCENE define value 4 in the shadertoy)

Changing the skybox, it looks like this. Notice the projection on the ground at the left has that circle shape?

Looking up into at what is above the spheres you can see that circle that was projected onto the ground. This is one of the coolest things about path tracing… you get a lot of techniques “for free” that are just emergent features of the math. It’s a lot different than approaching graphics in rasterization / non path traced rendering, where every feature you want, you basically have to make explicit code for to approximate.

Going back to roughness, here are some refractive spheres using an IOR of 1.1, but varying in roughness from 0 on the left, to 0.5 on the right. Notice that the spheres themselves get obviously more rough, but also their shadow / caustics change with roughness too. (SCENE define value 6 in the shadertoy)

It’s important to notice that in these, it’s just the SURFACE of the sphere that has any roughness to it. If you were able to take one of these balls and crack it in half, the inside would be completely smooth and transparent. In a future post (next, i think!) we’ll talk about how to make some roughness inside of objects, which is also how you render smoke, fog, clouds, and also things like skin, milk, and wax.

Another fun feature you can add to transparent objects is “absorption” which means that light is absorbed over distance as it travels through the object.

We’re going to use Beer’s law to get a multiplier for light that travels through the object, to make that light decrease. The formula for that is just this:

\text{Multiplier} = e^{(-\text{absorb} \cdot \text{distance})}

For that, we can use a different absorption value per color channel so that different colors absorb quicker or slower as the light goes through the object.

Here are some spheres that have progressively more absorption. A percentage multiplier goes from 0 on the left to 1 on the right and is multiplied by (1.0, 2.0, 3.0) to get the absorption value. Notice how the spheres change but so do the caustics / shadows. This is SCENE define value 3 in the shadertoy.

You can combine roughness and absorption to make some interesting things. Here is the same spheres but with some absorption (SCENE define value 0).

Here’s the same scene from a different view, which shows how rich the shading for these objects are.

Implementing Rough Refraction & Absorption

Let’s implement this stuff!

The first thing to do is to add more fields to our SMaterialInfo struct.

We should have something roughly like this right now:

struct SMaterialInfo
    vec3 albedo;           // the color used for diffuse lighting
    vec3 emissive;         // how much the surface glows
    float percentSpecular; // percentage chance of doing specular instead of diffuse lighting
    float roughness;       // how rough the specular reflections are
    vec3 specularColor;    // the color tint of specular reflections
    float IOR;             // index of refraction. used by fresnel and refraction.

Change it to this:

struct SMaterialInfo
    // Note: diffuse chance is 1.0f - (specularChance+refractionChance)
    vec3  albedo;              // the color used for diffuse lighting
    vec3  emissive;            // how much the surface glows
    float specularChance;      // percentage chance of doing a specular reflection
    float specularRoughness;   // how rough the specular reflections are
    vec3  specularColor;       // the color tint of specular reflections
    float IOR;                 // index of refraction. used by fresnel and refraction.
    float refractionChance;    // percent chance of doing a refractive transmission
    float refractionRoughness; // how rough the refractive transmissions are
    vec3  refractionColor;     // absorption for beer's law    

You might notice that there is both a specular roughness as well as a refraction roughness. That could be combined into a single “surface roughness” which was used by both if desired. You lose the ability to make reflections have different roughness than refraction, but that wouldn’t be a huge loss.

Also, there is a specular chance, and refraction chance, with an implied diffuse chance making those sum to 1.0.

Instead of doing chances that way, another thing to try could be to make the diffuse chance be the sum of the albedo components, the specular chance be the sum of the specular color components, and the refraction chance be the sum of refraction color components. You could then divide them all by the sum of all 3 chances, so that they summed up to 1.0.

This would let you get rid of those percentages, and should be pretty good choices. The only weird thing is that the meaning of refraction color is sort of reversed compared to the others. The others are how much light to let through, but refraction color is how much light to block. You might have better luck subtracting refraction color from (1,1,1) and using that to make a refraction chance.

Since the light from rays is divided by the probability of choosing that type of ray (diffuse, specular, or refractive), it doesn’t really matter how you choose the probabilities for choosing the ray type, but when the probabilities better match the contribution to the pixel’s final color (more contribution = higher percentage), you are going to get a better (more converged) image more quickly. This is straight up importance sampling, and i’ll stop there so we can keep this casual 🙂

The next change is to address a problem we are likely to hit. Now that our material has quite a few components, it would be real easy to forget to set one when setting material properties. If you ever forget to set one, you are either going to have uninitialized values, or values coming from a previous ray hit that is no longer valid. That is going to make some real hard to track bugs, along with the possibility of undefined behavior due to uninitialized values (AFAIK). So, I made a function that initializes a material roughly to zero, but mainly just to sane values. The idea is that you call this function to init a material, and then set the specific values you care about.

SMaterialInfo GetZeroedMaterial()
    SMaterialInfo ret;
    ret.albedo = vec3(0.0f, 0.0f, 0.0f);
    ret.emissive = vec3(0.0f, 0.0f, 0.0f);
    ret.specularChance = 0.0f;
    ret.specularRoughness = 0.0f;
    ret.specularColor = vec3(0.0f, 0.0f, 0.0f);
    ret.IOR = 1.0f;
    ret.refractionChance = 0.0f;
    ret.refractionRoughness = 0.0f;
    ret.refractionColor = vec3(0.0f, 0.0f, 0.0f);
    return ret;

Every time i set material info (like, every time a ray intersection passes), i first call this function, and then set the specific values I care about. I also call it on the SRayHitInfo struct itself when i first declare it in GetColorForRay().

Next, also add a bool “fromInside” to the SRayHitInfo. We are going to need this to know when we intersect an object if we hit it from the inside, or the outside. Refraction made it so we can go inside of objects, and we need to know when that happens now.

struct SRayHitInfo
    bool fromInside;
    float dist;
    vec3 normal;
    SMaterialInfo material;

In TestQuadTrace(), where it modifies the ray hit info, make sure and set fromInside to false. We are going to say there is no inside to a quad. You could make it so the negative side of the quad (based on it’s normal) was inside, and maybe build boxes etc out of quads, but I decided not to deal with that complexity.

TestSphereTrace() already has the concept of whether the ray hit the sphere from the inside or not, but it doesn’t expose it to the caller. I just have it set the ray hit info fromInside bool based on that fromInside bool that already exists.

The various uses of roughness and percentSpecular are going to need to change to specularRoughness and specularChance.

The final shading inside the bounce for loop changes quite a bit so i’m going to just paste this below, but it is commented pretty heavily so hopefully makes sense.

The biggest thing worth explaining I think is how absorption happens. We don’t know how much absorption is going to happen when we enter an object because we don’t know how far the ray will travel through the object. Because of this, we can’t change the throughput to account for absorption when entering an object. We need to instead wait until we hit the far side of the object and then can calculate absorption and update the throughput to account for it. Another way of looking at this is that “when we hit an object from the inside, it means we should calculate and apply absorption”. This also handles the case of where a ray might bounce around inside of an object multiple times before leaving (due to specular reflection and fresnel happening INSIDE an object) – absorption would be calculated and applied at each internal specular bounce.

for (int bounceIndex = 0; bounceIndex < = c_numBounces; ++bounceIndex)
    // shoot a ray out into the world
    SRayHitInfo hitInfo;
    hitInfo.material = GetZeroedMaterial();
    hitInfo.dist = c_superFar;
    hitInfo.fromInside = false;
    TestSceneTrace(rayPos, rayDir, hitInfo);
    // if the ray missed, we are done
    if (hitInfo.dist == c_superFar)
        ret += SRGBToLinear(texture(iChannel1, rayDir).rgb) * c_skyboxBrightnessMultiplier * throughput;
    // do absorption if we are hitting from inside the object
    if (hitInfo.fromInside)
        throughput *= exp(-hitInfo.material.refractionColor * hitInfo.dist);
    // get the pre-fresnel chances
    float specularChance = hitInfo.material.specularChance;
    float refractionChance = hitInfo.material.refractionChance;
    //float diffuseChance = max(0.0f, 1.0f - (refractionChance + specularChance));
    // take fresnel into account for specularChance and adjust other chances.
    // specular takes priority.
    // chanceMultiplier makes sure we keep diffuse / refraction ratio the same.
    float rayProbability = 1.0f;
    if (specularChance > 0.0f)
        specularChance = FresnelReflectAmount(
            hitInfo.fromInside ? hitInfo.material.IOR : 1.0,
            !hitInfo.fromInside ? hitInfo.material.IOR : 1.0,
            rayDir, hitInfo.normal, hitInfo.material.specularChance, 1.0f);
        float chanceMultiplier = (1.0f - specularChance) / (1.0f - hitInfo.material.specularChance);
        refractionChance *= chanceMultiplier;
        //diffuseChance *= chanceMultiplier;
    // calculate whether we are going to do a diffuse, specular, or refractive ray
    float doSpecular = 0.0f;
    float doRefraction = 0.0f;
    float raySelectRoll = RandomFloat01(rngState);
    if (specularChance > 0.0f && raySelectRoll < specularChance)
        doSpecular = 1.0f;
        rayProbability = specularChance;
    else if (refractionChance > 0.0f && raySelectRoll < specularChance + refractionChance)
        doRefraction = 1.0f;
        rayProbability = refractionChance;
        rayProbability = 1.0f - (specularChance + refractionChance);
    // numerical problems can cause rayProbability to become small enough to cause a divide by zero.
    rayProbability = max(rayProbability, 0.001f);
    // update the ray position
    if (doRefraction == 1.0f)
        rayPos = (rayPos + rayDir * hitInfo.dist) - hitInfo.normal * c_rayPosNormalNudge;
        rayPos = (rayPos + rayDir * hitInfo.dist) + hitInfo.normal * c_rayPosNormalNudge;
    // Calculate a new ray direction.
    // Diffuse uses a normal oriented cosine weighted hemisphere sample.
    // Perfectly smooth specular uses the reflection ray.
    // Rough (glossy) specular lerps from the smooth specular to the rough diffuse by the material roughness squared
    // Squaring the roughness is just a convention to make roughness feel more linear perceptually.
    vec3 diffuseRayDir = normalize(hitInfo.normal + RandomUnitVector(rngState));
    vec3 specularRayDir = reflect(rayDir, hitInfo.normal);
    specularRayDir = normalize(mix(specularRayDir, diffuseRayDir, hitInfo.material.specularRoughness*hitInfo.material.specularRoughness));

    vec3 refractionRayDir = refract(rayDir, hitInfo.normal, hitInfo.fromInside ? hitInfo.material.IOR : 1.0f / hitInfo.material.IOR);
    refractionRayDir = normalize(mix(refractionRayDir, normalize(-hitInfo.normal + RandomUnitVector(rngState)), hitInfo.material.refractionRoughness*hitInfo.material.refractionRoughness));
    rayDir = mix(diffuseRayDir, specularRayDir, doSpecular);
    rayDir = mix(rayDir, refractionRayDir, doRefraction);
    // add in emissive lighting
    ret += hitInfo.material.emissive * throughput;
    // update the colorMultiplier. refraction doesn't alter the color until we hit the next thing, so we can do light absorption over distance.
    if (doRefraction == 0.0f)
        throughput *= mix(hitInfo.material.albedo, hitInfo.material.specularColor, doSpecular);
    // since we chose randomly between diffuse, specular, refract,
    // we need to account for the times we didn't do one or the other.
    throughput /= rayProbability;
    // Russian Roulette
    // As the throughput gets smaller, the ray is more likely to get terminated early.
    // Survivors have their value boosted to make up for fewer samples being in the average.
        float p = max(throughput.r, max(throughput.g, throughput.b));
        if (RandomFloat01(rngState) > p)

        // Add the energy we 'lose' by randomly terminating paths
        throughput *= 1.0f / p;            

Orbit Camera

Now that we can do all these cool renders, it kind of sucks that the camera is stuck in one place.

It really is a lot of fun being able to move the camera around and look at things from different angles. It’s also really helpful in debugging – like when we were able to see the circle on the ceiling of the skybox!

Luckily an orbit camera is pretty easy to add!

First drop this function in the buffer A tab, right above the mainImage function. You can leave the constants there or put them into the common tab, whichever you’d rather do.

// mouse camera control parameters
const float c_minCameraAngle = 0.01f;
const float c_maxCameraAngle = (c_pi - 0.01f);
const vec3 c_cameraAt = vec3(0.0f, 0.0f, 20.0f);
const float c_cameraDistance = 20.0f;

void GetCameraVectors(out vec3 cameraPos, out vec3 cameraFwd, out vec3 cameraUp, out vec3 cameraRight)
    // if the mouse is at (0,0) it hasn't been moved yet, so use a default camera setup
    vec2 mouse = iMouse.xy;
    if (dot(mouse, vec2(1.0f, 1.0f)) == 0.0f)
        cameraPos = vec3(0.0f, 0.0f, -c_cameraDistance);
        cameraFwd = vec3(0.0f, 0.0f, 1.0f);
        cameraUp = vec3(0.0f, 1.0f, 0.0f);
        cameraRight = vec3(1.0f, 0.0f, 0.0f);
    // otherwise use the mouse position to calculate camera position and orientation
    float angleX = -mouse.x * 16.0f / float(iResolution.x);
    float angleY = mix(c_minCameraAngle, c_maxCameraAngle, mouse.y / float(iResolution.y));
    cameraPos.x = sin(angleX) * sin(angleY) * c_cameraDistance;
    cameraPos.y = -cos(angleY) * c_cameraDistance;
    cameraPos.z = cos(angleX) * sin(angleY) * c_cameraDistance;
    cameraPos += c_cameraAt;
    cameraFwd = normalize(c_cameraAt - cameraPos);
    cameraRight = normalize(cross(vec3(0.0f, 1.0f, 0.0f), cameraFwd));
    cameraUp = normalize(cross(cameraFwd, cameraRight));   

The mainImage function also changes quite a bit between the seeding of rng and the raytracing – basically the stuff that controls jitter for TAA and calculates the camera vectors. The whole function is below, hopefully well enough commented to understand.

void mainImage( out vec4 fragColor, in vec2 fragCoord )
    // initialize a random number state based on frag coord and frame
    uint rngState = uint(uint(fragCoord.x) * uint(1973) + uint(fragCoord.y) * uint(9277) + uint(iFrame) * uint(26699)) | uint(1);    
    // calculate subpixel camera jitter for anti aliasing
    vec2 jitter = vec2(RandomFloat01(rngState), RandomFloat01(rngState)) - 0.5f;

    // get the camera vectors
    vec3 cameraPos, cameraFwd, cameraUp, cameraRight;
    GetCameraVectors(cameraPos, cameraFwd, cameraUp, cameraRight);    
    vec3 rayDir;
        // calculate a screen position from -1 to +1 on each axis
        vec2 uvJittered = (fragCoord+jitter)/iResolution.xy;
        vec2 screen = uvJittered * 2.0f - 1.0f;
        // adjust for aspect ratio
        float aspectRatio = iResolution.x / iResolution.y;
        screen.y /= aspectRatio;
        // make a ray direction based on camera orientation and field of view angle
        float cameraDistance = tan(c_FOVDegrees * 0.5f * c_pi / 180.0f);       
        rayDir = vec3(screen, cameraDistance);
        rayDir = normalize(mat3(cameraRight, cameraUp, cameraFwd) * rayDir);
    // raytrace for this pixel
    vec3 color = vec3(0.0f, 0.0f, 0.0f);
    for (int index = 0; index  0.1);
    // average the frames together
    vec4 lastFrameColor = texture(iChannel0, fragCoord / iResolution.xy);
    float blend = (iFrame  0.0 || lastFrameColor.a == 0.0f || spacePressed) ? 1.0f : 1.0f / (1.0f + (1.0f / lastFrameColor.a));
    color = mix(lastFrameColor.rgb, color, blend);

    // show the result
    fragColor = vec4(color, blend);

After you’ve done that, you can drag the left mouse in the image to move the camera around 🙂

If you follow these steps with the scene from last post, and change the central sphere to be a bit refractive, you can finally view the box from a different angle!


Before we go, I want to show some refractive spheres with IOR 1, refraction roughness 0, but fading between a purely refractive surface on the left, to a purely diffuse surface on the right. This is SCENE define value 5 in the shadertoy.

Compare this one to the purely absorption scene shown before, and also the purely roughness scene.

All of these 3 obscured what was behind them more as the effect was turned up. All of them also could have the effect partially in effect, to any level.

So, all 3 of these examples were semi transparent.

Which of these is it we are talking about when we are doing traditional alpha blending in a rasterized scene?

Well it turns out to be a bit ambiguous.

If the shadow of a semi transparent object is colored, that means absorption is the best model for it. If the shadow of a semi transparent object is just darkened, the partially diffuse surface may be the best model (could also be absorption though). If the object doesn’t cast a shadow (or barely does), it is an object made semi transparent by roughness on the surface of the object, or INSIDE the object (we’ll talk about that case later). That is especially true if the appearance (lighting) of the object changes significantly when the lighting behind it (or around it) changes, vs an object which is only lit from the front.

Kind of a strange thing thinking about alpha as all these different effects, but good to be aware of it too if doing realistic rendering.

Who knew transparency would be so opaque?

Oh PS – a color tip I got from an artist once was to never use pure primary colors (I think it was Ron Harvey @ monolith. If you are reading this Ron, hello!). This is nice from an artistic standpoint, but also, if you think about how lighting is multiplied by these various colors, whenever you make any color channel 0, it means none of that color channel will make it through that multiplication. Completely killing light of any kind feels wrong and does make things look subtly wrong (or not subtly!), because nothing in the real world is so perfect or pure that it doesn’t reflect SOME light of a different frequency than it’s main frequency or frequencies. If it were so impossibly pure, dust would fall on it and change that. Also, it’s probably a good idea to stay away from 1 for similar reasons – nothing perfectly reflects anything in the real world, there is always some imperfection – again, if it were, dust would fall on it and change it.

I think volumetrics and depth of field are next 🙂

Thanks for reading, and i’d love to see anything you make. You can find me at @Atrix256

Casual Shadertoy Path Tracing 2: Image Improvement and Glossy Reflections

Posts in this series:

  1. Basic Camera, Diffuse, Emissive
  2. Image Improvement and Glossy Reflections
  3. Fresnel, Rough Refraction & Absorption, Orbit Camera

Below is a screenshot of the shadertoy that goes with this post. Click to view full size. That shadertoy can be found at:

In this blog post we are going to improve the quality of our rendered image from last time, and also add specular / glossy reflections, to end up with the image above. Compare that to the image below, which is where we are starting from, after part 1.

On the menu today is…

  • Anti Aliasing – This makes pixelated edges be smooth, which makes for a much better render
  • sRGB – The space we do lighting calculations in are not the same as the display wants. This fixes that.
  • Tone Mapping – This converts unbounded lighting to a displayable range.
  • Exposure – This lets us brighten or darken our image before tone mapping, just like a camera’s exposure setting does.
  • Glossy Reflections – We make some objects shiny

Anti Aliasing

Adding anti aliasing to a path tracer is super easy. All we need to do is add a random number between -0.5 and +0.5 to our pixel’s location, on the x and y axis.

Since the pixel color is the average over lots of frames, what this does is give us the “average color of the pixel”. That means if the edge of a shape goes through a pixel, the pixel will average the color of the shape, and also the color of whatever is in the background. That gives us smooth anti aliased edges.

To do this, find this code:

// calculate coordinates of the ray target on the imaginary pixel plane.
// -1 to +1 on x,y axis. 1 unit away on the z axis
vec3 rayTarget = vec3((fragCoord/iResolution.xy) * 2.0f - 1.0f, cameraDistance);

and replace it with this:

// calculate subpixel camera jitter for anti aliasing
vec2 jitter = vec2(RandomFloat01(rngState), RandomFloat01(rngState)) - 0.5f;
// calculate coordinates of the ray target on the imaginary pixel plane.
// -1 to +1 on x,y axis. 1 unit away on the z axis
vec3 rayTarget = vec3(((fragCoord+jitter)/iResolution.xy) * 2.0f - 1.0f, cameraDistance);

Doing that gives us this image below. You might want to click to view it full size and compare it vs last image, to see how the edges of the spheres are smooth now. Viewing the smaller image hides the pixelation.


The colors that we send to a display are not the same as the colors that we see. This is because human vision is not linear.

With only 256 shades of color in 8 bit per color channel images, if we split the shades evenly, we’d find that there wasn’t enough dark shades because our eyes can see more detail in darker shades.

To make up for this, sRGB was made, which makes it so the values between 0 and 255 are not linearly spaced – there are more values in darker colors and less in the lighter colors.

If we do our lighting calculations in sRGB though, the math doesn’t work right because there was a non linear transform applied to our color channels.

To fix this, whenever we read a color from a texture, we should convert from sRGB to linear. Also, whenever we write out a pixel, we want to convert from linear to sRGB so it shows up correctly on the display.

We need to add these functions to our shadertoy. Make a new “common” tab and put it in there. The common tab is a header included by all other tabs, so things we put there are usable by all other tabs. In the shader that goes with this post, i put many utility functions, constants, and user controllable parameters in the common tab.

vec3 LessThan(vec3 f, float value)
    return vec3(
        (f.x < value) ? 1.0f : 0.0f,
        (f.y < value) ? 1.0f : 0.0f,
        (f.z < value) ? 1.0f : 0.0f);

vec3 LinearToSRGB(vec3 rgb)
    rgb = clamp(rgb, 0.0f, 1.0f);
    return mix(
        pow(rgb, vec3(1.0f / 2.4f)) * 1.055f - 0.055f,
        rgb * 12.92f,
        LessThan(rgb, 0.0031308f)

vec3 SRGBToLinear(vec3 rgb)
    rgb = clamp(rgb, 0.0f, 1.0f);
    return mix(
        pow(((rgb + 0.055f) / 1.055f), vec3(2.4f)),
        rgb / 12.92f,
        LessThan(rgb, 0.04045f)

ERRATA SRGBToLinear() just returned the input, in the original version of this post. That was incorrect, and correcting that changes the results you get from here on out. The only thing using the function was the skybox lookup, but that means you will get different results than are shown in the screenshots here. The problem isn’t on your end, it’s mine. Sorry about that! In fact, instead of skybox lighting being too bright as i talk about later on and having to multiply it by 0.5, it ends up being too dim and in the final shader you’ll see i multiply it by 2. Apologies and thank you for reading 🙂

There are two places we need to use this functionality. The first place is where we read the cube map when rays miss the scene geometry.

Find this code:

// if the ray missed, we are done
if (hitInfo.dist == c_superFar)
    ret += texture(iChannel1, rayDir).rgb * throughput;

Change it to this:

// if the ray missed, we are done
if (hitInfo.dist == c_superFar)
    ret += SRGBToLinear(texture(iChannel1, rayDir).rgb) * throughput;

The other place we want to change is where we write the final pixel color out.

In the “Image” tab, find this code:

vec3 color = texture(iChannel0, fragCoord / iResolution.xy).rgb;

and replace it with this:

vec3 color = texture(iChannel0, fragCoord / iResolution.xy).rgb;

// convert from linear to sRGB for display
color = LinearToSRGB(color);

Doing that, we get this image:

You might notice the balls changed color significantly. That’s because they were sRGB colors before but now they are linear colors. We can adjust their colors to make them look better though. If we change the 0.75s if their albedo to 0.5s, that gives us an image like this:

Tone Mapping & Exposure

The next thing to tackle is the fact that when we are calculating lighting, the range of light goes from 0 (perfect black) up to an unbounded amount of brightness, but our display can only display values between 0 and 1.

Tone mapping is the process of remapping the values to the 0 to 1 range. Tone mapping preserves detail in the darks while also making pixels that are too bright also have recognizable detail. We are going to use a luminance only fit of the ACES tone mapper. This tone mapping code was made by Krzysztof Narkowicz and you can read more about it on his website:

Here is the function to put in the common tab:

// ACES tone mapping curve fit to go from HDR to LDR
vec3 ACESFilm(vec3 x)
    float a = 2.51f;
    float b = 0.03f;
    float c = 2.43f;
    float d = 0.59f;
    float e = 0.14f;
    return clamp((x*(a*x + b)) / (x*(c*x + d) + e), 0.0f, 1.0f);

That function takes linear color in and gives linear color out, so we need to call it after we get the raw color, but before we convert that color to sRGB, in the image tab.

    vec3 color = texture(iChannel0, fragCoord / iResolution.xy).rgb;

    // convert unbounded HDR color range to SDR color range
    color = ACESFilm(color);

    // convert from linear to sRGB for display
    color = LinearToSRGB(color);
    fragColor = vec4(color, 1.0f);

Doing that we get this image.

The lighting on the back wall has mellowed out a bit and looks a lot more natural which is nice. It still seems a bit bright though.

To fix this we are going to do 2 things. Firstly, we are going to multiply the value we read from the skybox by 0.5 to darken the sky box a little. (i won’t paste code for that, just multiply the texture value by 0.5 after converting it to linear)

Secondly, we are going to apply EXPOSURE to the scene, by multiplying the color by a value of 0.5 before we put it through the tone mapper. That makes it so the entirety of the code in the “Image” tab is this:

void mainImage( out vec4 fragColor, in vec2 fragCoord )
    vec3 color = texture(iChannel0, fragCoord / iResolution.xy).rgb;

    // apply exposure (how long the shutter is open)
    color *= c_exposure;

    // convert unbounded HDR color range to SDR color range
    color = ACESFilm(color);

    // convert from linear to sRGB for display
    color = LinearToSRGB(color);
    fragColor = vec4(color, 1.0f);

With those two changes, our render now looks like this:

Glossy Reflections

Now for the part that everyone is waiting for: glossy reflections to make shiny things!

If you’ve done simple lighting in shaders before, you’ve probably used reflect() to calculate a light reflection ray to do some specular highlights (shiny spots) on objects.

We are going to put albedo and emissive into a SMaterialInfo struct inside of SRayHitInfo, and we are also going to add these fields:

  • percentSpecular – A float between 0 and 1. What percentage of the light that hits this object is going to be reflected specularly instead of diffusely. This is the percentage chance that a ray hitting this surface is going to choose a specular reflection instead of a diffuse one
  • roughness – A float between 0 and 1. How rough the surface is, which controls how blurry the reflection is. A value of 0 is a very sharp clean mirror like reflection, and a value of 1 is so blurry it looks just like diffuse.
  • specularColor – A vec3 which is the color of specular reflections, just like how albedo is the color of diffuse reflections. This lets you have different colors for diffuse and specular reflections, letting you have colored metal and similar.

Here is how our shading is going to work:

  • We are going to roll a random number from 0 to 1. If it’s less than percentSpecular, we are doing a specular ray, else we are doing a diffuse ray.
  • Diffuse rays use cosine weighted random hemisphere samples like before.
  • A fully rough specular ray is going to use that diffuse ray direction.
  • A zero roughness specular ray is going to use a ray direction calculated from reflect().
  • A specular ray with roughness between 0 and 1 is going to square the roughness (personal preference, you don’t have to square it), and use that value to linearly interpolate between the specular and diffuse ray directions, normalizing the result.
  • Instead of always multiplying throughput by albedo, if it’s a specular ray, we are going to instead multiply it by specularColor

That is all there is to it! Find this code:

// update the ray position
rayPos = (rayPos + rayDir * hitInfo.dist) + hitInfo.normal * c_rayPosNormalNudge;

// calculate new ray direction, in a cosine weighted hemisphere oriented at normal
rayDir = normalize(hitInfo.normal + RandomUnitVector(rngState));

// add in emissive lighting
ret += hitInfo.emissive * throughput;

// update the colorMultiplier
throughput *= hitInfo.albedo;

And replace it with this:

// update the ray position
rayPos = (rayPos + rayDir * hitInfo.dist) + hitInfo.normal * c_rayPosNormalNudge;

// calculate whether we are going to do a diffuse or specular reflection ray 
float doSpecular = (RandomFloat01(rngState) < hitInfo.material.percentSpecular) ? 1.0f : 0.0f;

// Calculate a new ray direction.
// Diffuse uses a normal oriented cosine weighted hemisphere sample.
// Perfectly smooth specular uses the reflection ray.
// Rough (glossy) specular lerps from the smooth specular to the rough diffuse by the material roughness squared
// Squaring the roughness is just a convention to make roughness feel more linear perceptually.
vec3 diffuseRayDir = normalize(hitInfo.normal + RandomUnitVector(rngState));
vec3 specularRayDir = reflect(rayDir, hitInfo.normal);
specularRayDir = normalize(mix(specularRayDir, diffuseRayDir, hitInfo.material.roughness * hitInfo.material.roughness));
rayDir = mix(diffuseRayDir, specularRayDir, doSpecular);

// add in emissive lighting
ret += hitInfo.material.emissive * throughput;

// update the colorMultiplier
throughput *= mix(hitInfo.material.albedo, hitInfo.material.specularColor, doSpecular);

If we update our scene geometry and materials a bit, we now can get the final image:

The green balls in the back have a percentSpecular of 1.0, so they are all specular. Their roughness from left to right is: 0, 0.25, 0.5, 0.75, 1.0. Their specular color is (0.3, 1.0, 0.3) which makes them have a green tint. In PBR speak, this would be a green metal (we aren't using PBR specular calculations though!).

The yellow ball in the front left has a percent specular of 0.1, a roughness of 0.2 and a specular color of (0.9f, 0.9f, 0.9f). In PBR speak, this is a dielectric.

The pinkish ball in the front center has the same except a percent specular of 0.3, so is shinnier. (Still a dielectric)

The blue and magenta ball in the front right is there to show that if you pick bad material values, you can get something that doesn't look very good. That ball has an albedo of pure blue, a percent specular of 0.5, a roughness of 0.5 and a specular color of pure red. That isn't a very realistic thing you could see in the real world – a blue object with red reflections – so it doesn't look very good in the render. Garbage in, garbage out. In PBR speak, you could get this by having a dual layer material… a blue very rough or diffuse dielectric, with a clear coat that acts like a metal? It doesn't make much sense in PBR which further shows that it isn't a very well motivated material, which is why it doesn't look very good.

If you feel like taking things a little further, our render could look better with fresnel, making glancing ray angles be more reflective. I'm not planning on doing that in this series but I do have a blog post on it here:

Bonus 1: Russian Roulette

Each ray will bounce up to 8 times, but there are times when it really doesn’t need 8 bounces to give the right look.

If we were somehow able to detect this, we could end a ray early, especially if we knew how to handle the probabilities correctly to not add bias from ending early.

Russian roulette is a method for doing just that.

We are going to find the maximum color channel value in our throughput, R, G or B, and that value will be the percentage chance that we keep going with our ray after each bounce. When we do keep going, we are going to divide our throughput by that chance of keeping the ray, which makes up for the fact that we kill some of the rays there some of the other times, by making future bounces brighter.

What this means is that as the throughput gets lower, meaning future ray bounces are likely to impact the final image less, we become more likely to stop the ray bouncing around early.

This is mainly just a performance optimization, but is good because it lets you converge faster due to being able to get more samples more quickly.

Just put this code right under where we update the throughput by multiplying it by diffuse or specular.

// Russian Roulette
// As the throughput gets smaller, the ray is more likely to get terminated early.
// Survivors have their value boosted to make up for fewer samples being in the average.
    float p = max(throughput.r, max(throughput.g, throughput.b));
    if (RandomFloat01(rngState) > p)

    // Add the energy we 'lose' by randomly terminating paths
    throughput *= 1.0f / p;

To give an idea of how this affects performance, on my machine at (i forget) resolution and (i forget) renders per frame, i was getting 30 fps which is 33.3 milliseconds per frame. After I put in this Russian roulette code, it went up to 40 fps which is 25 milliseconds per frame. Same exact visual results, but 8ms less render time per frame, or a 25% improvement in speed.

Bonus 2: Space Bar to Reset

It can be hard to get a full screen screenshot in shadertoy, because you have to click the button to make it full screen (see image below) and then it already has some of the small rendering still there on the screen.

We can actually make it so when you press the space bar, that it resets the render.

How you do this is instead of using iFrame to figure out how much to blend this frame into the next, you store the blend (1/framesRendered) in the alpha channel of each pixel. You can then assign the keyboard as a texture to iChannel2 (it’s under the Misc tab) and then read that “keyboard texture” to see if the space key was pressed.

Find this code:

// average the frames together
vec3 lastFrameColor = texture(iChannel0, fragCoord / iResolution.xy).rgb;
color = mix(lastFrameColor, color, 1.0f / float(iFrame+1));

// show the result
fragColor = vec4(color, 1.0f);

And replace it with this (after assigning the keyboard as a texture in the iChannel2 slot!):

// see if space was pressed. if so we want to restart our render.
// This is useful for when we go fullscreen for a bigger image.
bool spacePressed = (texture(iChannel2, vec2(32.5/256.0,0.25)).x > 0.1);

// average the frames together
vec4 lastFrameColor = texture(iChannel0, fragCoord / iResolution.xy);
float blend = (lastFrameColor.a == 0.0f || spacePressed) ? 1.0f : 1.0f / (1.0f + (1.0f / lastFrameColor.a));
color = mix(lastFrameColor.rgb, color, blend);

// show the result
fragColor = vec4(color, blend);

Now, you can flip to full screen and press space bar to reset the render to get a better larger render.


Don’t forget… if you are trying to make a render and you are seeing 60fps, that means you probably could be converging faster and you are waiting for no good reason. Try turning c_numRendersPerFrame up until you aren’t at 60fps anymore. Anything lower than 60 means it’s working near capacity, getting it down to 10 or 5 or lower fps isn’t going to help you, so don’t worry about bringing your shadertoy tab to a crawl, it doesn’t do any good.

Be on the lookout for the next post in this series. We have some more topics to casually implement, including: depth of field and bokeh, camera movement, participating media (fog), bloom, refraction & transparencies, and more.

Thanks for reading and share any cool things you make with me on twitter at @Atrix256

A Link Between Russian Roulette and Rejection Sampling / Importance Sampling

This post explains how Russian Roulette can be viewed as rejection sampling a PDF to do importance sampling.

The explanation sticks to simple 1D functions and has some ~500 lines of c++ to accompany it that made the data for the explanations.

The code is at:

Background Topics

Here are some quick definitions and links to more info.

Monte Carlo Integration – Using random samples of a function to find the area under the curve if you graphed it. Aka using random numbers to integrate a function numerically.

Importance Sampling – Using random numbers in Monte Carlo integration where the probabilities of numbers drawn aren’t the same for each number. As the graph of the probabilities looks more similar to the graph of the function being sampled, Monte Carlo integration gets closer to the right answer a lot more quickly. At the extreme, it only takes one sample to get the exact answer!

PDF – Probability density function. This is a function that describes how probable different numbers are for being generated. It integrates to 1, but might have values that go above one. The area under the curve between 2 values is the percent chance probability of a random number falling between those two values.

Uniform Distribution – Random numbers (A PDF) where all possible values are equally likely. If you made a histogram of the numbers generated from this distribution, it would be flat, if you generated an infinite number of values.

Rejection Sampling – If you know the shape of a PDF you would like to draw random numbers from, but you aren’t sure how to actually generate random numbers following that PDF, rejection sampling is one way to do that.

Russian Roulette – If you have something like an infinitely recursive (or iterative) integral, like you find with the rendering equation, Russian Roulette is a way to stop the infinite process early while still getting the correct answer. It stops early without adding bias. More generally it lets you skip doing some of the work you have to do, without adding bias.

Here are the links:

Monte Carlo Integration Explanation in 1D

Generating Random Numbers From a Specific Distribution With Rejection Sampling

Russian Roulette and Splitting (PBRT book chapter 13.7)

You’ll also see the code for this post doing a “lerp” to average sample points. If you want more info about that, check this out:

Incremental Averaging

Integrating y=x^2 from 0 to 1

Let’s look at a couple different ways to integrate a function numerically, to get a sense of things.

We’re going to integrate y=x^2 from 0 to 1 three different ways. First we’ll use regular Monte Carlo with a uniform distribution, second we’ll importance sample it using rejection sampling, and thirdly, we’ll use Russian roulette which as you’ll see is also importance sampling using rejection sampling.

Simple Monte Carlo integration works like this:

  1. Take N random samples of a function between A and B
  2. Sum them up and divide by N to get the average height of the function between A and B
  3. Multiply by B-A to get the area under the curve

Doing that with the accompanying code, after 1000 samples the value it comes up with is 0.298374. The actual value is 1/3 or 0.33333… so it got fairly close. Check out the “// monte carlo” section in the code inside of TestFlat().

We’ll do the same steps again but this time do importance sampling with rejection sampling.

Since we know y=x^2 is 0 when x is 0 and 1 where x is 1, we can use rejection sampling to make a probability density function that is similar.

Whenever we generate a random number “x” between 0 and 1 to use in the function, we’ll say there is an “x” percent chance that we keep that number, otherwise we throw it away and generate a new random number. Another way of saying this is that there is a 1-x percent chance that we are going to throw it away and generate a new random number.

That probability of keeping numbers looks like the function y=x when you do this. y=x isn’t a proper PDF though because the area under the curve is 0.5, not 1, so you need to multiply it by 2 to make it a proper PDF. This doesn’t actually change the probabilities, this is just making a “proper PDF” that describes the probabilities. So, the PDF for these random numbers is y=2x.

We have to change how we use these random numbers too though. After we calculate our y value by plugging x into y=x*x, we need to divide it by the PDF value for x before we add them to the sum (the sum which we later divide by N to get the average). You just plug x into the PDF function y=2x to get the PDF value. We divide by the PDF because what this does is make it so numbers that come up more often weigh less into the final average. For instance, if one number came up twice as often than another in your random numbers, you should make it count for half as much to the final average, to make things equal like they are in plain vanilla Monte Carlo.

When we do this with 1000 samples, we get a value of 0.334112 (Monte Carlo was 0.298374). You might find it interesting that we had to generate 2052 potential x values for the 1000 we actually used. 1052 were thrown away.

Next up we are going to do Russian Roulette. How this works is we are going to have a probability where we just aren’t going to take a sample AT ALL. We are going to modify the first step of the Monte Carlo algorithm:

  1. Take N random samples of a function between A and B. There is some percent chance “p” we won’t actually do a sample, but will just use a value of 0. If we do take a sample, we are going to divide the value we got by 1-p which will make it a larger value to make up for the times we didn’t take a sample.
  2. Sum them up and divide by N to get the average height of the function between A and B
  3. Multiply by B-A to get the area under the curve

If we do that with a probability of 1-x of killing a sample (aka an “x” percent probability of using a sample. Is this sounding familiar?), and do 1000 attempts, we get a value of 0.327171. We didn’t do as well as we did with importance sampling using rejection sampling, but that used a full 1000 samples, while this method only ended up using 746 samples actual samples.


In the rejection sampling algorithm, we used x as the percent probability to determine whether we actually plug that x into the function, or a 1-x percent probability to generate a different x value. We loop until success.

In the Russian roulette algorithm, we used 1-x as the percent probability to use 0 instead of taking a sample, or x as the probability to take a sample and divide it by (1-x) to make it larger to make up for samples we killed.

Does those sound pretty similar to you?

As further evidence, here’s the histogram of x values generated by each of those algorithms, along with a graph of the PDF y=2x, modifying the code to take 100k samples instead of 1000, to reduce the random noise in the graph (make it converge more). This shows that they are generating random numbers from the same PDF, which is y=2x.

Here is the absolute error of each technique for each sample. Rejection sampling a PDF of y=2x is showing lowest error, followed by Russian Roulette with a probability of 1-x to kill a sample, and lastly in the rear comes vanilla Monte Carlo. It’s important to remember that rejection sampling and monte carlo are taking new samples each time you take a step forward on the x axis, but for russian roulette, it’s sometimes not taking a sample and just using a value of 0 instead.

Looking at the standard deviation, the square root of variance, this tells us how erratic the samples are.

Unfortunately, the worst variance comes from the Russian Roulette algorithm. I don’t have a good sense of why this is or how it can have worse variance than Monte Carlo but come up with a better result. Mathematically, I know the reason is due to the zeroes used when a sample is killed, but intuitively, I don’t grok how this can be true or what it means exactly to be better converged, but have worse variance. Something to chew on I guess (or feel free to educate me if you know!)

Why Does Russian Roulette Boost Values?

Ok so let’s take a quick break to get some intuition for why Russian roulette boosts values up when they survive the random killing.

Let’s say we have a function y=1, so it’s just a function that is 1 everywhere.

If we wanted to integrate that from 0 to 1 with Monte Carlo using four uniform random samples, we’d get four 1s, sum them up to get 4, divide by the number of samples (4) and get our result of 1.

That’s the right answer: a function that has a y value of 1 everywhere, for x being between 0 and 1 is a square of length 1 on both the x and y axis. That square has an area of 1.

If we wanted to use Russian roulette here with a 25% chance to kill a sample, we could assume that one of our four samples was killed. That means we get 3 samples each getting a value of 1, and the sample we killed has a value of 0.

If we sum the samples we’d get 3. If we divide them by the number of samples taken, we’d get 3/4 or 0.75.

This is the incorrect answer!

Since we know we are going to kill one out of every four samples, what we could do is boost every survivor to make up for the samples we killed.

The algorithm says we divide by 1 minus the probability, which is in this case 0.75 or 3/4.

Dividing by 3/4 is the same as multiplying by 4/3.

So, going back to our samples we have four samples. Three of them were 1, multiplied by 4/3 to get 4/3. The last sample was killed so is 0.

Summing these samples up, we get 12/3 which reduces to 4.

Dividing that sum by the total number of samples (4) we get 1.

That is the RIGHT answer.

Russian roulette boosts the survivors to make up for the missing information of the samples that were killed.

Sure, it’s random, and the boost amount may not actually match the number of samples actually killed, but given enough samples it’ll be true, or close enough to true.

Bonus: use blue noise or low discrepancy sequences to make this be more true for smaller sample counts.

Iterative / Recursive Functions

You might be saying to yourself that this example is stupid, because nobody uses Russian Roulette this way, and you may in fact be doubting that it is Russian Roulette because it’s used in such a foreign setting.

The usual place you see it used if you are a graphics person is in terminating a ray while it’s bouncing around a scene so that you can kill it early without bias, when the ray isn’t likely to be contributing much value to the result if it continues.

Well, same thing applies. Using it in this setting just lets you say “ah screw it, i’m stopping here” with some probability. The rest of the time, the values you DO get are boosted up to make up for the slackers who quit early.

The only real difference is that because it’s recursive, each survival grows the weight of not only the current iteration but all future iterations too – this is because the current iteration relies also on the future iterations.

Let’s quickly see some data for a recursive/iterative function:

  1. calculate y = f(x) = sin(x*pi/2)
  2. Divide x by 2 and do it again
  3. Do this 1000 times
  4. The sum of the above is the result

We are going to integrate that from 0 to 1, meaning we are going to take random numbers between 0 and 1, plug them into the above, and the average will be the area under the curve.

After doing this 10,000 times…

Monte Carlo gives us a value of 1.403612 with a standard deviation of 0.733499.

Since the basic function sin(x*pi/2) is 0 when x is 0 and 1 when x is 1, i used the same probabilities for Russian Roulette as before. That gave me a value of 1.285901, with a standard deviation of 3.493362. Instead of doing a full 1000 loops through the function though, for those 10000, it did 0 loops at minimum, 5 at max, and 0.71 on average. That is quite a savings on iteration count.

To be honest, i’m not sure what the right answer to this function is, so i can’t say how correct these answers are (A less lazy person would run Monte Carlo with a very large number of samples). They are pretty close to each other though which makes me think they must be in the ballpark. If that’s true, they have fairly similar error (at least in a spherical cow sense), but one did 1000 iterations, while the other did 0 or 1 on average, and 5 maximum. That is pretty cool! Admittedly the variance got worse again though.

Using a different probability for Russian Roulette, where the PDF more closely matched what the actual shape of the function was (the iterated function, not just a single incarnation of it?) would get you a better answer quicker. i didn’t really experiment with that very much.

I did think that 0 to 1 iterations on average was pretty low though, so instead of making the probability be 1-x to kill a sample, i tried making it (1-x)/5. That way, the probability shape is still the same, but samples are less likely to be killed, so they should survive more iterations.

That did in fact work!

After doing that 10,000 times, Russian Roulette gave a value of 1.378251 with a standard deviation of 0.939193. The minimum number of iterations is still zero, but the maximum went up to 45, and the average went up to 4.89. We got to keep the shape of our PDF, but got deeper through the iterations. So, we are sort of importance sampling on 2 axes – fun 😛

The value changed and I’m unsure if it’s for better or worse, but it got closer to the plain Monte Carlo answer, which is a good sign. The variance also got better which is neat, but is still worse than the Monte Carlo variance.

Here’s a graph of the value of the functions as the number of samples grow:

Here’s a graph of the standard deviation. The probability 1-x has really bad variance, while dividing it by 5 has much better variance.


So hopefully you now can see a link between Russian Roulette and importance sampling using rejection sampling.

What I found neat about this perspective is that instead of Russian Roulette being some “one off” technique, I can merge it mentally with rejection sampling and importance sampling, and free up some brain space.

I also like this “off brand” of Russian Roulette.

Basically, rejection sampling is a “no go” for real time graphics in general, like for use in a shader. The reason for this is that rejection sampling takes an unknown number of iterations each time to generate a single valid sample.

If you instead use Russian Roulette, it’s a fixed upper bound amount of work, but you will get some randomly less number of samples of your data.

Another way to put it is that when generating numbers from a PDF…

  • Rejection sampling requires an unknown sized loop to generate a fixed number of samples
  • Russian Roulette has a fixed loop to generate an unknown number of samples

Engineering is all about trade offs, so having this trade off is pretty handy IMO.

Let’s do a quick, simple recap on Rejection Sampling vs Russian Roulette for integrating a y=f(x) function using a specific PDF.

I’m defining a function A(x) which is the probability of accepting a value “x” chosen. The function A is just a scaled version of the PDF such that all values in the PDF are <= 1. This function gives you the probability of keeping a value "x" after randomly selecting it – for both rejection sampling, and russian roulette.

Rejection Sampling

  1. Draw a number x from a uniform distribution between a and b.
  2. Roll a random number from 0 to 1. If that number is greater than A(x), go to 1.
  3. calculate y=f(x)/pdf(x) to get a sample
  4. The result of the integral is the average y value, multiplied by (b-a)

The count of random numbers you had to generate is dynamic, to get a fixed number of samples.

The number of random x’s generated is >= the number of samples you got out.

The count on the left changes.

Russian Roulette

  1. Draw a number x from a uniform distribution between a and b.
  2. Roll a random number from 0 to 1. If that number is greater than A(x), y=0
  3. Else, y=f(x)/A(x)
  4. The result of the integral is the average y value, multiplied by (b-a)

The count of random samples you had to generate is fixed, to get a dynamic number of samples.

The number of random x’s generated is still >= the number of samples you got out.

The count on the right changes.

Last thing: I could be wrong, but i believe that Rejection Sampling is a “Las Vegas” algorithm, while Russian Roulette is a “Monte Carlo” algorithm.

Assuming i’m correct in my understanding there, Monte Carlo algorithms are preferred in real time graphics, games, and other real time situations. The ability to have knowledge or control over how long something will take to execute is in general way more important than the quality of the result you get. In modern times of temporal accumulation, this is even more true, since poor data every now and then can be soaked up by the temporal accumulation, to an extent.

And in those situations, you can do even better by using random or quasi random sequences which have better properties over both space and time, to help ensure that you get better results over smaller regions of space, and over shorter periods of time.