When I see material on Markov chains, it usually comes in two flavors:

- Very Mathy
- Pretty impressive results light on explanation

It turns out the reason for this is because they CAN be very mathy but they can also be extremely simple.

Without knowing this, I decided it was time to learn about Markov chains. I leveled up my linear algebra knowledge a bit, finally getting a solid grasp on eigen vectors, and learning things like how to put a matrix into an eigen basis form to be able to make matrix exponentiation a trivial operation. There are links at bottom of post if you want to learn this stuff too.

Then, I sat down to learn Markov chains and nearly flipped my table over! Yes, Markov chains can be mathy (and matrix exponentiation is one way to find a Markov chain steady state, but not the best), but that stuff isn’t really required for most uses.

A Markov chain is just any situation where you have some number of states, and each state has percentage chances to change to 0 or more other states.

You can get these percentages by looking at actual data, and then you can use these probabilities to GENERATE data of similar types / styles.

This post uses Markov chains to generate text in the style of provided source text.

The first step it does is analyze source text.

To analyze the source text, it goes through text, and for each word it finds, it keeps track of what words came next, and how many times those words came next.

When analyzing the story “The Tell-Tale Heart” by Edgar Allan Poe for instance (https://poestories.com/read/telltaleheart , also is data/telltale.txt in the code that goes with this post), here are the words that came after “when” and their counts.

- all – 1
- enveloped – 2
- he – 1
- i – 4
- my – 2
- overcharged – 1
- the – 1

Here are the counts for the words that appear after “is”:

- but – 1
- impossible – 1
- merely – 1
- nothing – 1
- only – 1
- the – 2

After all these counts have been gathered up, the next step is to convert them into probabilities. You do this by summing up the words that come after a specific word, and dividing the count of each word by that total sum.

The above examples then turn from counts to probabilities. Here is “when”:

- all – 8%
- enveloped – 16%
- he – 8%
- i – 33%
- my – 16%
- overcharged – 8%
- the – 8%

Here is “is”:

- but – 14%
- impossible – 14%
- merely – 14%
- nothing – 14%
- only – 14%
- the – 28%

Note: The code that goes with this post spits out these counts and percentages in the “out/stats.txt” file if you ever want to see the data.

Once the probabilities are known, you can start generating text. The first thing you do is pick a word purely at random, this is the first word in the text.

Next, you use the probabilities of what words come after that word to randomly choose the next word.

You then use the probabilities of what words come after that word to randomly choose the next word.

This repeats until you’ve generate as much text as you want.

The code with this post generates 1000 words into the “out/generated.txt” file.

That is literally all there is to it. You could do this same process with sheet music to generate more music in the same style, you could do it with weather forecasts to generate realistic weather forecasts (or even try to use it to predict what weather is next). You can do this with any data you can imagine.

Here is 100 words of generated text from various sources.

First is text generated from “The Tell-Tale Heart” by Edgar Allan Poe (https://poestories.com/read/telltaleheart):

…About trifles, and with perfect distinctness — very slowly, my sagacity. I then took me, louder — you cannot imagine how stealthily — with what caution — cautiously — would have told you may think that no longer i knew that no blood – spot. He would not even his room, to do the hour had made up my whole week before him. I knew what dissimulation i showed them causeless, undisturbed. Now a hideous heart, no — wide open — all and the old man, and he would have…

Here is text generated from “The Last Question” by Isaac Asimov (http://hell.pl/szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm):

…Glory that. Man said, it into a meaningful answer. Granted, said, might be kept from the entire known to restore the universe for meaningful answer. Mq – talkie robot, ac learned how many stars are dying. The boys appreciated that not. Cosmic ac that, how may be able to reach the small station, said at half the same. He shrugged. We’ll have enough to be alone. And lose itself aloof. When any other kind of universal ac. He consisted of individuals were self – contact…

Here is text generated from a research paper “Projective Blue-Noise Sampling” (http://resources.mpi-inf.mpg.de/ProjectiveBlueNoise/ProjectiveBlueNoise.pdf):

…Numerical integration. Mj patterns to vector multiplication to achieve a way that the above question whether there exist distributions have addressed anisotropic classic lloyd relaxation green and rotated pattern significantly worse than the j 1, where each site: our projective blue – noise point distributions along both axes. Previous work sampling when undergoing one after a certain number of common blue noise patterns, but at the publisher s ., cohen – left constructs a quality of latinizing the non – sample counts however, as a set only in a theory 28, this shrinkage…

Here is text generated from an example (not real, but representative) psych report from my wife who is a school psychologist:

…Brother had to mildly impaired body movement, the school and placement after a 90 probability that student: adapting to struggle as video games. Student’s planning and he request, spelling subtest scores. This time. The student: this time and accurately with both, including morphology, 2013. Administrators should consider participation in the following are student as intellectually disabled specific auditory comprehension of reading: mr. Mrs. The two subtest is designed to use of or economic disadvantages, gestures, vitality or economic disadvantages, picking at approximately 5th grade prior…

Here we generate a markov chain using ALL the above source texts, to get a mash up of all of them.

…Restore the sphere packing radius is likely an adaptive skills. Please see inset in the conner s problems, we’ll just have well and visualization and he is computed on 1 2 was contacted by things, and restricted number of his abilities. We can simply like them, as well as a s difficulty interacting with a closer to cry, the process based on the standards – appropriate to spurious aliasing artefacts mit87, making a meaningful answer. Finally, 11 months through hyperspace to try his eye contact. Jerrodine’s eyes were going out if…

Lastly, here is only Poe and Asimov combined:

…Could not forever, and continually increased. And stood for a sudden springing to get back and the eighth night i to that man, 2061, but the original star and made trips. A very, and fell full youthfulness even to feel — i then stop someday in five words on a while i heard all the noise steadily for us, calling him to pluto and now a galaxy alone pours out, quick sound would think of individuals. He stirred his hideous veil over the ceiling. Twenty billion years ago, man, …

Using one word to generate the next word works somewhat well – the generated Poe text definitely seemed like Poe for instance – but there are plenty of times when things don’t make much sense.

A markov chain can become higher order when you don’t just look at the current state to transition to the next state, but you look at the last N states to transition to the next state.

In the text generation case, it means that a 2nd order Markov chain would look at the previous 2 words to make the next word. An order 3 markov chain would look at the previous 3 words to make the next word.

Interestingly, an order 0 Markov chain looks at NO WORDS to generate the next word, so is purely random word generation, with similar word counts (by percentage) as the original text.

The code that goes along with this post lets you specify the order on the Markov chain.

Here is “The Tell-Tale Heart” with an order two markov chain.

…Dark as midnight. As the bell sounded the hour, there came to my ears: but he had been too wary for that. A tub had caught all — ha ha when i describe the wise precautions i took for the concealment of the old man sprang up in bed, crying out — no blood – spot whatever. I removed the bed and examined the corpse. Yes, he was stone, stone dead. I knew that he had been lodged at the police. A watch’s minute hand moves more quickly than did…

If you compare that to the actual story, you can find fairly large sections of that are taken verbatim from the source text, but the arrangement of those larger chunks are different.

The reason for this is that when you have two words mapping to the next word, the number of these go up, which makes it so on average, there are going to be fewer choices for “next words”, which make the results less random, and more deterministic.

If you gave it more text (like, maybe, all of Edgar Allan Poe’s work), there would be more options for the next word after specific 2 word pairs, but with a single short story, it doesn’t have very many choices. If you look at the out/stats.txt file and compare order 1 vs order 2, you can see that order 2 has a lot more situations where a current state maps to a single next state.

At order 3 there are even fewer choices, and it hits a pattern loop:

…Had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to fear there entered three men, who introduced themselves, with perfect suavity, as officers of the police. A shriek had been heard by a neighbor during the night; suspicion of foul play had been aroused; information had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to…

Here is an order 2 mashup of Poe and Asimov:

…Crossing the floor, and still chatted. The universal ac interrupted zee prime’s own. It had to be contrary, and jerrodette i. Ask multivac. As the passage through hyperspace was completed in its place, each cared for by perfect automatons, equally incorruptible, each with its dreadful echo, the real essence of men was to be contrary now, now, honeys. I’ll ask microvac. Don’t shout. When the sun, and their only concern at the visiplate change as the frightened technicians felt they could hold their breath no…

Lastly, here’s an order 2 mashup of all 4 source texts:

…Mathematics: student does not require special education and related services, the radius of each other, indistinguishable. Man said, ac organized the program. The purpose of this report provides information about the child s educational performance. Other pertinent future work includes the extension of our projective lloyd patterns against other patterns on a role not based on his scores on this scale is different for the sake of visual clarity, we specify all spaces via a set x. In a way, man, i undid it just so much that a single…

When combining the texts, it might make sense to “normalize” the percentages for each source text. How it works now with raw counts makes it so longer documents have more of their style preserved in the final output document.

You may also want to give weightings to different text so you can have a sliding scale between Poe and Asimov for instance, by basically scaling the counts from their files higher or lower to give more or less representation in the results.

When analyzing the text, I had to think about what to do with punctuation. I chose to treat punctuation as words in themselves, but ignored some punctuation that was giving weird results – like double quotes. I’ve only just now realized that I incorrectly ignore question marks. Oops.

When generating text, i made it so some words don’t put a space before themselves (like, a period!), and i also made it so words would have their first letter capitalized after a period or similar. There seems to need ad hoc, domain specific massaging to get reasonable results.

It’s possible (especially with higher order markov chains) that you can get into a situation where your current state has nothing to transition to. You’d have to figure out what to do in this case. One idea would be to choose a next word at random. Another idea would be to fall back to a lower order markov chain maybe?

I feel like once you understand the algorithm, it’s an art form to teach and tune the Markov chain to get good results. I bet there are some interesting techniques beyond the simple things I’ve done here.

**Mathy Markov Chain Info**

If you want to dive into the mathy side of markov chains, here are some great resources you can follow to get there…

A great linear algebra online “text book”, that is very easy to read and understand: http://immersivemath.com/ila/index.html

Some great videos on linear algebra: https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

A 9 part series on markov chains. It’s this long because it’s very explicit and works through the details by hand. I watched it at like 1.5x speed and was fine

Some “mathy” notes about Markov chains, including higher order ones:

http://personal.psu.edu/jol2/course/stat416/notes/chap4.pdf

**Q Learning**

Related to markov chains, Q learning is essentially is a way to learn a Markov chain from data – for instance learning how to play tic tac toe, or how to traverse a maze.

I would like to learn Q learning better and make a post (and code!) at some point.

Q Learning Explained With HTML5

https://blockulator.github.io/Q-Learning-Explained-With-HTML5/

An introduction to Q-Learning: reinforcement learning

https://medium.freecodecamp.org/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc

Reinforcement Learning Tutorial Part 1: Q-Learning

https://blog.valohai.com/reinforcement-learning-tutorial-part-1-q-learning

Reinforcement Learning Tutorial Part 2: Cloud Q-learning

https://blog.valohai.com/reinforcement-learning-tutorial-cloud-q-learning

Reinforcement Learning Tutorial Part 3: Basic Deep Q-Learning

https://towardsdatascience.com/reinforcement-learning-tutorial-part-3-basic-deep-q-learning-186164c3bf4

**Other**

Here is a twitter conversation about some compelling uses of Markov chains

Here’s a video “Markov Chain Monte Carlo and the Metropolis Algorithm” which uses Markov chains to help calculate integrals numerically.

Again, the code for this post is up on github at https://github.com/Atrix256/TextMarkovChain

The code is written for readability and runs plenty fast for this demo (nearly instant in release, a couple seconds in debug) but There are lots of string copies etc that you would want to fix up if using this code seriously.

Thanks for reading!

]]>This algorithm works well but it is blind to the actual values it got when making guesses, beyond just checking if they were greater or less than the search value.

I recently wondered: If we knew the min and max value stored in the list, couldn’t we make a more intelligent guess as to where the search value might be? We could fit the data with a line, figure out where our guess would be on that line, and make that be our initial guess. As we iterate, we could use our incorrect guesses as new min or max values of the line as appropriate, updating our line fit as we went, and perhaps arrive at an answer more quickly.

Another way of looking at this: If the guess a binary search made is VERY far from the search value, maybe it should go farther than the midpoint when making the next guess? Or, if it was pretty close to the search value, maybe it shouldn’t go as far as the midpoint? Close vs far measurements depend on the overall magnitude of the numbers in the list, so you’d need to know what sort of values are stored. A min and a max value of the list can give you a rough idea of that, especially if you update those min / max values as you repeatedly cut the list with guesses.

This post explores that idea. The result is something that could be more attractive than binary search, depending on what kind of trade offs are being looked for. While I haven’t heard of this technique , I wouldn’t be surprised if it’s been tried before and written about. (Know of a source? let me know!).

UPDATE: @thouis from twitter mentioned the basic idea is called “interpolation search”. This post goes beyond that basic idea but you can read more about it here if you’d like https://www.techiedelight.com/interpolation-search/. He has a paper about interpolation search that you can read here (it has some relation to discrepancy, as in low discrepancy sequences, oddly!) https://erikdemaine.org/papers/InterpolationSearch_SODA2004/

The post goes a step further to address a problem that is encountered when using this algorithm, and also talks about other ways this algorithm might be extended or generalized.

An implementation, and the code that generated all the data for this post, can be found here: https://github.com/Atrix256/LinearFitSearch

(Feel free to skip this section if you get lost. You won’t miss anything important about the algorithm itself)

If you are wise in the ways of numbers, you might be saying to yourself that this only works if you have roughly evenly distributed numbers – basically, a flat PDF, or a flat histogram. This is because by only knowing the min and max, you are doing a linear fit of the data, and making guesses as if your data is well represented by that line. The less like a line your data actually is, the less good this ought to work.

That is true, and I thought up this idea while trying to think of how to generate 1d blue noise more quickly, which is random but roughly evenly spaced values. For that usage case it does well, but there are many types of non linear data out there that you might want to search through.

Really what you want to do is learn the distribution of the values in the list, and use that knowledge to know where the value you are searching for is likely to be.

I didn’t go that direction in these experiments, but it seems like a data scientist would have plenty of tools in their tool box to attempt something like that. Markov chain Monte Carlo type algorithms come to mind.

There’s another way to look at the problem of searching for a value in a list, and that’s to look at it as strictly a function inversion problem.

If you look at your sorted list as a lookup table, where the index is the x value, and the value stored is the y value, a search tries to tell you the x value for a specific y value that you are searching for.

In this context you only care about integer values of x, and there might be duplicate values in the list, making it not a strictly monotonic function – not having each y value be larger than the last y value – but has a more relaxed version where each y value is >= the last y value.

Thinking about the search problem as a function inversion problem, ignoring the monotocity issue, there are far too many data points to do an analytic inverse, so you would be looking at numerical inverse solutions.

I also didn’t really explore that direction, so it’s another way to go that might yield some better fruit.

Lastly, you could see searching a sorted list as a root finding problem. If you are looking for where the function minus the search value equals zero, numerical root finding functions could maybe help you here. I also did not try anything in that direction.

If anyone ends up exploring any of the alternative avenues, I’d love to hear what kind of techniques you used and what your results were!

The algorithm works like this…

- Start with a sorted list, and the minimum and maximum value stored in that list.
- Calculate a line fitting the min and max. For an equation y=mx+b, you are calculating m and b.
- Using the inverse of the function, which is x=(y-b)/m, make a guess for what index (x) the search value (y) is at by plugging the search value into that equation as y and getting an x. That x is the index you are guessing the value is at.
- If your guess was correct, you are done so exit. Otherwise, if the guess was too high, this is your new max. If the guess was too low, this is your new min. If you’ve run out of list to search, the value isn’t there, so exit.
- Goto 2

This algorithm assumes the sorted list looks like a line if you were to graph it, so it does better when the sorted list actually looks like a line.

Let’s see how it does for a linear list with values in it between 0 and 2000. (Click to see full size image)

The left image shows the items in the array.

In the middle image, x axis is the number of items in the list, and y axis is how many guesses it took to search for a random value. This shows the average of 100 runs.

In the right image, it shows the minimum and maximum guesses it took for each list size, for those same 100 runs.

The linear fit did pretty well didn’t it? At minimum it took zero guesses (the search value was less or equal to min or greater or equal to max), and at maximum it took 2 guesses to find the search value, regardless of list size.

Binary search took about the usual log2(N), as expected.

Let’s try a list made up of random numbers between 0 and 2000.

That looks pretty similar to the linear case, but the line fit search doesn’t beat binary search by quite as much. The randomness of the list makes it so the guesses are more often wrong, and so it takes a few extra guesses to find the right place.

Let’s try a quadratic function: y=2000x^2:

The average for line fit search still beats binary search, but if you look at the min/max graph, the line fit min and max entirely encompasses the binary search min and max. That means there is a ton of variance about whether it will be faster or slower than binary search, even though on average it will be faster.

Let’s try a cubic function: y=2000x^3:

While the average still (barely) beats binary search, the maximum for line fit search has gotten REALLY erratic.

Let’s try a log function:

Ouch, the line fit is actually doing worse now than the binary search.

Lastly, let’s go back to the linear list, but let’s make the last entry in the table be 200,000 instead of 2000:

Ouch! Linear fit search is super awful now. What happened?!

It turns out that this uneven histogram type of list is really a worst case scenario for the line fit search.

What is happening here is that it sees the min as 0 and the max as 200,000 so it thinks the line is very steep. On it’s first guess, everything it could search for (it searches for a random value between 0 and 2000), it will think the value is at index 0. It will very likely be wrong, and elminate index 0. The next round, it will choose index 1, be very likely wrong again, and repeat by picking 2 then 3 then 4 and so on. This data layout nearly forces this search to a more computationally expensive version of linear search. Binary search doesn’t have this problem because it doesn’t care what the values are, it just cuts the list in half repeatedly until it’s done.

Wouldn’t it be nice if we could know whether it’d be better to use binary search or linear fit search for a data set?

We’d have to analyze the data set to figure that out, and if we are going to go to all that trouble, we probably should just learn the shape of the data set in general and use that knowledge to make a better guess than either binary search or linear fit.

I think going that route could be fruitful, but I didn’t try it. Instead I came up with a Hybrid Search.

Here is my more readable, less optimized code for the linear fit search.

TestResults TestList_LineFit(const std::vector<size_t>& values, size_t searchValue) { // The idea of this test is that we keep a fit of a line y=mx+b // of the left and right side known data points, and use that // info to make a guess as to where the value will be. // // When a guess is wrong, it becomes the new left or right of the line // depending on if it was too low (left) or too high (right). // // This function returns how many steps it took to find the value // but doesn't include the min and max reads at the beginning because // those could reasonably be done in advance. // get the starting min and max value. size_t minIndex = 0; size_t maxIndex = values.size() - 1; size_t min = values[minIndex]; size_t max = values[maxIndex]; TestResults ret; ret.found = true; ret.guesses = 0; // if we've already found the value, we are done if (searchValue < min) { ret.index = minIndex; ret.found = false; return ret; } if (searchValue > max) { ret.index = maxIndex; ret.found = false; return ret; } if (searchValue == min) { ret.index = minIndex; return ret; } if (searchValue == max) { ret.index = maxIndex; return ret; } // fit a line to the end points // y = mx + b // m = rise / run // b = y - mx float m = (float(max) - float(min)) / float(maxIndex - minIndex); float b = float(min) - m * float(minIndex); while (1) { // make a guess based on our line fit ret.guesses++; size_t guessIndex = size_t(0.5f + (float(searchValue) - b) / m); guessIndex = Clamp(minIndex + 1, maxIndex - 1, guessIndex); size_t guess = values[guessIndex]; // if we found it, return success if (guess == searchValue) { ret.index = guessIndex; return ret; } // if we were too low, this is our new minimum if (guess < searchValue) { minIndex = guessIndex; min = guess; } // else we were too high, this is our new maximum else { maxIndex = guessIndex; max = guess; } // if we run out of places to look, we didn't find it if (minIndex + 1 >= maxIndex) { ret.index = minIndex; ret.found = false; return ret; } // fit a new line m = (float(max) - float(min)) / float(maxIndex - minIndex); b = float(min) - m * float(minIndex); } return ret; }

Since binary search and linear fit search both have situationally good properties, I decided to try a hybrid of the two where it switches between the two for each guess. The first guess is a linear fit, the next is a binary search guess, then back to linear fit, and so on.

Here’s where that puts things with the previous worst case scneario: the linear data with a single huge outlier. New graph on top, old on bottom for comparison. Apologies that the colors aren’t consistent between old and new!

There’s quite a bit of variance, and the linear fit min and max contains the binary search min and max, but on average it does beat the binary search now, which is kind of neat.

Let’s analyze the line fit worst performers to best performers and see how the hybrid search compares.

Here’s the log function:

The variance has decreased compared to line fit. The average beats binary search too, where the non hybrid test didn’t.

Next is the cubic function:

With the non hybrid approach, cubic on average was barely beating binary search and had a huge amount of variance. The hybrid average is beating binary search by a larger margin and the variance has dropped a lot.

Here’s quadratic:

The line fit search beat binary search, like the hybrid search does. It even beats it by roughly the same amount. The hybrid search has a lot less variance though, which is a nice property. You’ll have more consistent timings as you search.

Here’s random:

The hybrid search does a little worse both for average, and variance, than the linear fit search did.

Last is linear:

it’s impossible to see where the hybrid max line is, but it went up to 3, from the 2 that line fit max was at, which also brings the average up just a little bit. In my opinion, that isn’t so bad that we slightly damaged the perfectly linear and random cases in favor of making it much more robust in the general case.

Here is my more readable, less optimized code for the hybrid search. The only meaningful difference is on line 48 where it chooses to do a linear fit or binary search step, and line 72 where it toggles which one it does next.

TestResults TestList_HybridSearch(const std::vector<size_t>& values, size_t searchValue) { // On even iterations, this does a line fit step. // On odd iterations, this does a binary search step. // Line fit can do better than binary search, but it can also get trapped in situations that it does poorly. // The binary search step is there to help it break out of those situations. // get the starting min and max value. size_t minIndex = 0; size_t maxIndex = values.size() - 1; size_t min = values[minIndex]; size_t max = values[maxIndex]; TestResults ret; ret.found = true; ret.guesses = 0; // if we've already found the value, we are done if (searchValue < min) { ret.index = minIndex; ret.found = false; return ret; } if (searchValue > max) { ret.index = maxIndex; ret.found = false; return ret; } if (searchValue == min) { ret.index = minIndex; return ret; } if (searchValue == max) { ret.index = maxIndex; return ret; } // fit a line to the end points // y = mx + b // m = rise / run // b = y - mx float m = (float(max) - float(min)) / float(maxIndex - minIndex); float b = float(min) - m * float(minIndex); bool doBinaryStep = false; while (1) { // make a guess based on our line fit, or by binary search, depending on the value of doBinaryStep ret.guesses++; size_t guessIndex = doBinaryStep ? (minIndex + maxIndex) / 2 : size_t(0.5f + (float(searchValue) - b) / m); guessIndex = Clamp(minIndex + 1, maxIndex - 1, guessIndex); size_t guess = values[guessIndex]; // if we found it, return success if (guess == searchValue) { ret.index = guessIndex; return ret; } // if we were too low, this is our new minimum if (guess < searchValue) { minIndex = guessIndex; min = guess; } // else we were too high, this is our new maximum else { maxIndex = guessIndex; max = guess; } // if we run out of places to look, we didn't find it if (minIndex + 1 >= maxIndex) { ret.index = minIndex; ret.found = false; return ret; } // fit a new line m = (float(max) - float(min)) / float(maxIndex - minIndex); b = float(min) - m * float(minIndex); // toggle what search mode we are using doBinaryStep = !doBinaryStep; } return ret; }

Just like binary search, the linear fit and hybrid search algorithms can return you the index to insert your value into the list, if not present.

Some folks may balk at the idea of having the min and max value of the list before you do a search, from the point of view that it’s sort of like 2 guesses that aren’t being counted against the graph. If that’s your point of view, you can add 2 to the values graphed and you can see that the hybrid search is still compelling. I think it’s perfectly reasonable that you’d know the min and max of a sorted list though. After all, we store the length, why not also the min and max?

It may not be optimal to do 1 step of line fit search and 1 step of binary search in the hybrid search method. It might be that by doing something like 1 binary step then 3 line fit steps, and repeating that pattern, may give you better results. It may also be a better idea to just do line fit search, but if you aren’t making good enough progress, throw in a binary search step. I didn’t explore this at all due to the “nice enough” results i got switching off every time.

I had a thought that it might be good to try doing an “online linear squares fit” while making guesses so that you learned the shape of the list while searching it. If that sounds interesting to you, give this a read: https://blog.demofox.org/2016/12/22/incremental-least-squares-curve-fitting/. I suspect that having a more localized fit (like in this post) performs better, but I might be wrong. I could also see doing a least squares fit of the data offline in advance so you had that data available, like a min and a max, before you started the search. A problem with doing a fit in general though is that you have to be able to invert the function of whatever you fit the data with. Quadratic or cubic seem like they are probably the limit of what you’d want to try to avoid ringing and the complexity of higher order function inversion.

You can make binary searches more cache friendly by putting them into binary trees stored in arrays. This makes it so for instance, that when you test index 0, you are really testing the half way point. If the search value is less than index 0, you look at index 1, else you look at index 2. The left and right child of an index is just index*2 and index*2+1. I bring this up, because the “fixed guess points” of a binary search make this possible. A linear fit search doesn’t have fixed guess points, which makes it not possible to do the same thing. I’m betting with some creativity, some better cache friendliness could be figured out for a linear fit search.

Following in that idea, is the concept of a cache oblivious b-tree. Check it out here: https://github.com/lodborg/cache-oblivious-btree

Another nice property of binary searching is that you can make it branchless and very SIMD friendly, or very friendly for simple hardware implementations. A linear fit search doesn’t seem as well suited for that, but again, maybe some creativity could help it be so. Here’s more about binary search operating like I just described: https://blog.demofox.org/2017/06/20/simd-gpu-friendly-branchless-binary-search/

Lastly, you might have noticed that the graph for the linear data set showed that the line fit and hybrid searches were taking fewer guesses as the list got larger. It looks impossible, and lets me make this dank meme:

What the heck is going on there?

The x axis of those graphs shows how large the list is, and the y axis is how many guesses are taken, but in all those linear lists of each size, the list linearly breaks up the range [0,2000]. It’s also always searching for random numbers in [0,2000]

In smaller lists, the numbers are more sparse, while in larger lists the numbers are more dense.

If you have a linear data set, and are using a linear fit to look for a number in that list that may or may not be there, a denser list will have the values there more often, and the first guess is going to more often be the correct location of the search value.

That’s what is happening, and that’s why it’s showing an improvement in the linear case as the list gets larger, because it’s also getting more dense.

Here’s a graph for a version of the test where the density is kept the same for each list. The lists are between [0,5*count] and the search values are in the same range.

It’s interesting and kind of cool that both the average and min/max are flat, but this is a best case scenario for the line fit (and hybrid) search, with the data actually being linear.

Ok finally we get to performance. Many of you fine folks were probably looking at the guess count graphs and thinking “So what? Where’s the perf measurements?” TL;DR I think this is a pareto frontier advancement but i’ll explain more.

here are the perf results but don’t be too quick to say “aha!”, because they need some explanation and context. These results are on my modern-ish gaming laptop.

Results:

- Linear search takes ~1.5 nanoseconds per guess. (eg, increment the index and read the next value from the array)
- Binary search takes ~5 nanoseconds per guess.
- Both linear fit and hybrid search takes ~12 nanoseconds per guess.

So, from my tests, binary search would need to take 2.5 times as many guesses as linear fit or hybrid searching to break even. The only case where that is true in my tests is the purely linear list.

Now that I’ve said that, I don’t think the tests I’ve done are really a good apples to apples comparison.

What I did as a test was generate lists of the various types described above, generated a list of random numbers to search for in them, then had each search algorithm do all the searches and i divided the total time by the total number of guesses done to get a time per guesses for each algorithm.

It is true that the linear fit is slightly more complicated logic than a binary search, or the linear search, so computationally I do expect it to take longer, and the 2.5x as long seems like a fair measurement.

HOWEVER, searching the same list over and over is an unrealistic pattern for most applications. More of the list would be likely to be in the cache when doing multiple searches back to back like this, so memory reading would be under-reported in the profiling.

Because the linear fit (and hybrid) searches are more computationally expensive, but end up doing fewer guesses, they use more cpu, but less memory bandwidth. That means that the wins they give would show up in times when memory reads (or wherever the list was stored) were slower. Having the list in the cache is not a time when the reads are going to be slower, so I think the testing is stacked against the linear fit and hybrid testing.

That said, I can’t think of a better “canned performance test” to compare apples to apples. You really would need to drop it in, in a realistic usage case for searching in an application, and see if it was better or worse for that specific usage case.

If you were memory bandwidth bound, and thus had some compute to spare, this search seems like it could possibly be a nice option. Or, in exotic situations where reading a list was VERY VERY slow (remote servers, homomorphic encryption, data stored on disk not in memory?) this could be a better algorithm. In those exotic situations where reads are way more expensive that computation, you’d probably want to go further though, and use more advanced algorithms to really make every guess count, using a lot more CPU to do so.

Lastly on perf: none of this code has been optimized. I wrote it for clarity, not speed. It’s possible that the comparison landscape could change (either for better or worse) with optimized code.

If anyone investigates perf more deeply, I’d love to hear results and in what context those results were found. Thanks!

An obvious questions is: can this search technique extend to quadratic and beyond?

I do think so. Let’s look at how that might work, and then i’ll point out some complications that make it more challenging.

Let’s think about the quadratic case. You’d need to start with a quadratic fit of the data, which would require 3 data samples from the list. Two data samples would be the first and last index just like the linear search, but where should the third data point be from?

One place it could be is in the middle of the list. If you can afford more processing time than that, you might consider picking whatever index gives the lowest error between the quadratic fit and the actual data stored in the array.

Now you have a quadratic fit of the data in the array and can begin searching. You have some y=f(x) function that is quadratic, and you invert it to get a x=f(y) function. All is well so far.

You make your first guess by pluggin your search value in for y and getting an x out which is your first guess for where the number is. When you read that number, if it is the search value, you are done. If it doesn’t match though, what do you do?

Your guess point is going to be between your min and max, but it might be to the left or the right of the third point you have in the quadratic fit. That is two possibilities.

Your guess may also be too low, or too high. That is two more possibilities, making for four possible outcomes to your guess.

Let’s say your guess was to the left of the “third point” and deal with these two outcomes first:

- If your guess was less than the search value, it means that your guess is the new minimum.
- If your guess was greater that the search value it means that your guess is the new maximum. A problem though is that your “third point” is now to the right of the search maximum. This isn’t so bad because it still fits real data on the curve but it seems a little weird.

If your guess was on the right of the “third point”, we have these two outcomes to deal with:

- If your guess was less than the search value, the guess is the new minimum, and the “third point” in the quadratic fit is to the left and is less than the minimum.
- If your guess was greater than the search value, the guess is the new maximum.

Are you with me so far? the “third point” seems oddly stationary at this point, but the next round of searching fixes that.

On the second step of searching (and beyond), we have some new possibilities to add to the previous four. The “third point” can either be less than the minimum or greater than the maximum. That is two possibilities.

And once again, we have two possibilities in regards to what our guess found: The guess value could be lower than the search value, or it could be higher.

Due to symmetry, let’s just consider the “third point” to be greater than our max, and then we can just consider the less than and greater than case:

- If our guess was too small, it’s the new minimum.
- If our guess was too large, it’s the new maximum, but the old maximum becomes the new “third point”. This moves the “third point” to be more local, giving us a more local quadratic fit of our data, which should help the search make better guesses.

So now, the “third point” moves around, and the quadratic fit is updated to be a localized fit, like we want it to be.

For the cubic case and above, I’ll leave that to you to sort out. It just is updating the minimum and maximums based on the guess value vs search value, and then doing a dance to make sure and keep the most local points around for the curve fit of the data, and throwing out the less local points to make room. I am pretty sure it’s extendable to any degree you want, and that one algorithm could be written to satisfy arbitrary degrees.

Now onto a complication!

Our very first step is to make an initial fit of data of whatever degree and then invert it. To invert the function, it needs to be monotonically increasing – aka there is no part on the graph where if you look at the point to the left, it’s higher. Each point on the graph should be higher than the point to the left.

The bad news is that if even looking at the quadratic case, making a quadratic curve pass through 3 data points A, B, C where A <= B <= C, the result is very often NOT going to be monotonic.

That means you are going to have a bad time trying to invert that function to make a guess for where a search value should be in the list.

I think a good plan of attack would be to fit it with a monotonic quadratic function that didn't necessarily pass through the 3 data points. That would affect the quality of your guess, but it might (probably should??) do better at guessing than a line fit, at the cost of being more computationally expensive. I'm not sure how to do that specifically, but I'd be surprised if there wasn't an algorithm for it.

For details on how even quadratic often isn't monotonic:

https://twitter.com/Atrix256/status/1108031089493184512

Some possibly good leads to dealing with this:

https://en.wikipedia.org/wiki/Monotone_cubic_interpolation

Thanks for reading. Hopefully you found it enjoyable.

If you use this, or do any related experimentation, I’d love to hear about it.

You can find me on twitter at https://twitter.com/Atrix256

]]>**Chebyshev curve fitting / interpolation** – with simple working c++ code. Possibly rational chebyshev too. Chebyshev apparently is optimal for polynomial interpolation.

https://www.embeddedrelated.com/showarticle/152.php

**Optimistic concurrency (in databases)**. Select data and version id per row. Update versionid = versionid+1, blah where blah and version id = version id. If rows affected is 0, that means something else beat you to the punch and you can deal with it however.

**Cordic math**. Every iteration in a loop gives you another bit of precision, since it’s basically a binary search.

**2D SDFs for vector graphics** – using modulus for “free repeating”. anti aliasing. use your shadertoy verlet physics game as an example?

**Verlet Physics** Keep last and current position to get implicit velocity. Simulate. Iterative constraint solving. Things “just work” pretty well.

**Minkowski Portal Refinement** A nice & simple algorithm for collision detection. Maybe talk about algorithm to get depth. Mention JGK, possibly do that after.

**Deterministic Simulations** using deterministic sim to eg decrease network traffic.

**Quick Math: phi / goden ratio** show how golden ratio conjugate is the same number past the decimal point. show how this is the only number that can do that. The main point being “i remember this fact, but don’t remember the number”. This fact lets you calculate the number.

**Quick math: eulers constant** show how e^x being the derivative (and integral) can only work for e. The main point being “i remember this fact, but don’t remember the number”. This fact lets you calculate the number.

**Ear clipping** – for turning polygons into triangles. extendable to 3d with tetrahedron clipping, and to higher dimensions as well.

**Storageless Shuffle With Weights** – this is like if you have 3 red marbles and 5 blue marbles, how would you use FPE to storagelessly shuffle them.

**Recurrent neural networks (etc) for “time series” learning** – https://twitter.com/Peter_shirley/status/1066832031043149824?s=03

**Markov Chain Monte Carlo – Eg. for decryption** maybe try 2nd order or higher chains.

Maybe also try with rendering / numerical integration http://statweb.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf

**Blue Noise AO** – It’s common to use white noise for sampling and also sample rotation. Start from there and show how to use blue for sampling and also rotation!

https://learnopengl.com/Advanced-Lighting/SSAO

http://john-chapman-graphics.blogspot.com/2013/01/ssao-tutorial.html

**Other blue noise usage cases** – specific usage cases with easy to follow implementations

* fog shafts

* shadows (pcf)

* reflections

* dithering

**Data Cache** When doing coding experiments, there are often pieces of data that take time to calculate that are based on parameters that don’t often change from run to run. Making a data cache can help. Semi compelling usage cases: 1) next largest prime greater than N. 2) integrate a bilinear function. Compare / contrast to content addressable storage. CAS is the hash of the contents, this is the hash of the params that make the contents. code: https://github.com/Atrix256/ProgressiveProjectiveBlueNoise/blob/master/cache.h

**Magic Eye Implementation** turn a generic depth buffer into magic eye?

https://steemit.com/steemit/@mynameisbrian/how-to-create-a-steemit-themed-magic-eye-image-using-photoshop

Paul Malin’s shadertoy: https://twitter.com/P_Malin/status/1084251862893817865?s=03

**Exposure and fstops**

exposure is a multiplication. fstops are 2^(stop). feels linear when using fstops. must do multiplication in linear space, else is wrong (show it’s wrong).

**Reaction Diffusion aka Turing patterns**

https://en.wikipedia.org/wiki/Turing_pattern

https://www.quantamagazine.org/ancient-turing-pattern-builds-feathers-hair-and-now-shark-skin-20190102/

**Kalman Filter**

https://home.wlu.edu/~levys/kalman_tutorial/

Biquad – a better frequency filter

Compressor & Limiter – automatic volume adjustment to eg avoid clipping. Include “side chain” stuff.

**Statistical Search**

Binary search works the way it does because it doesn’t know anything about the sorted list.

If you knew the min and the max in that sorted list, you could take a better guess at where to look first by finding the % that the value you are searching for is between min and max, and start at that % in the list.

The problem with that though is that it assumes an even distribution of numbers. If you have a bunch of small numbers and one huge number, this guess won’t be any good.

So, idea… if you fit the sorted list with some low order monotonic polynomial, you could reverse that to get an initial guess.

You could also update your best guess as to where the number was each time you looked in a place and got something that wasn’t your search value. This maybe using a kalman filter?

**Faster 1d blue noise generation **

1) brute force

2) binary search

3) linear search from initial guess from fit

4) kalman filter?

Use this stuff in generic data lists, not just in blue noise?

Maybe also a fixed # of buckets to cut search down. basically generate multiple in parallel and then append them together (but neighbors across edges do matter!)

james would…

I’d probably just put put uniform buckects

instead of a sorted array of numbers I’d try keeping the numbers ordered like a heep, then binary search is super faster, since it no longer suffers from non locality of memory access

an implicit binary tree where the childeren of a[i] are at a[i2+1] and a[i2 + 2]

Having hardware and API support for raytracing means we can use that to help with the usual raytraced graphics things – reflection, refraction, shadows and more – but it is also new hardware / API to abuse.

For instance, Inigo Quilez talks about how ray / triangle intersection could be used to do 3×3 linear equation solving: http://www.iquilezles.org/blog/?p=4666

In a similar style of hardware/API abuse (but not related to raytracing), I have shown how to make the linear texture interpolator calculate points on curves and surfaces when storing the control points in texels, and also showed how it can evaluate generic polynomials: https://blog.demofox.org/2016/02/22/gpu-texture-sampler-bezier-curve-evaluation/

In the not too distant future, I think it will be common to use raytracing for data lookups in real time graphics, much like we use textures currently. In fact, from a couple conversations I’ve had with folks on twitter, it seems as though some people are already doing this.

As strange as it sounds, raytracing has a significant advantage over texture lookups. A texture lookup is limited to data stored in a regular grid in pixels. Raytracing data lookups get their data from a mesh, with data stored in vertices (in a vertex buffer).

Storing data in the vertices of a mesh means that you can store data points wherever you want. If you need more detail in one area and less in another, you can put more vertices in the higher detail area, and fewer vertices in the low detail area. You could also store data in a blue noise sampling pattern to help fight aliasing problems you might have with the regular grid of a texture. Furthermore, you could actually have holes in your data for invalid data regions, by having holes in the mesh.

Essentially, a mesh is just a generalization of a texture. You are no longer locked to a grid!

How the data lookups are actually done is not too complex either.

For the 2d case where you have a function f(x,y), you would make a triangle mesh where the (x,y) position of each vertex was the location of a data point, and you would make the z value some constant such as 0.5.

To look up the value of the data for some (x,y) input, you could make a ray that started at (x,y,0) and went in direction (0,0,1). When you did your raytrace, you’d get as a result the triangle index and the barycentric coordinates of that triangle. From there you could look up the data from the 3 vertices of the triangle and use the barycentric coordinates to interpolate between the values. The interpolated value would be your result.

You can see how this process goes in the image below. The purple dot is the query location.

Just as there are volume textures to store 3d data in 3d textures, raytraced data lookups can also be extended to 3d.

For the 3d case where you have a function f(x,y,z), you would make a tetrahedral mesh where the (x,y,z) position of each vertex was the location of a data point.

To look up the value of the data for some (x,y,z) input, you need to be able to find what tetrahedron the point is in. To do this, you just shoot a ray starting at that (x,y,z) position in any arbitrary direction. That ray will hit one triangle in the tetrahedron. You then shoot a ray from the (x,y,z) position in the opposite direction to get a different triangle in the tetrahedron.

From these two triangles, you’ll have 6 vertices but only 4 will be unique. Those 4 vertices are the vertices of the tetrahedron. You can read the data from the vertices, calculate the barycentric coordinates of the point inside the tetrahedron, and then use those to interpolate the vertex data to get the result.

You can see how this process goes in the image below. The purple dot is the query location again.

The most immediate usage case I can think of for this technique would be for diffuse light probe grids. Whether you had a 2d or 3d light probe grid, you’d be able to make probes as dense as needed, or as sparse as you can get away with, in different sections of the geometry. You could also make holes in the mesh to make sure data didn’t interpolate through walls, leading to light leaking. You would use the techniques described above to interpolate the simplex data and get the result.

As this data is likely going to be relatively simple geometry compared to something like an actual game asset, it seems like it ought to be able to be pretty performant too.

Nathan Reed shared a really good idea on twitter, for doing the 3d lookup with only a single raytrace. The idea is that when you knew what triangle you hit, you could look in a table to get the fourth vertex based on whether you hit the triangle from the front or the back. One way to do this would be to have a buffer that had two vertex indices per triangle. The first index would be if you hit from positive, the second would be if you hit from negative.

That way, the index you’d look at would be [triangleIndex*2 + hitBackSide ? 1 : 0]. That data lookup ought to be a lot cheaper than a second raytrace!

Thread:

Can using raytracing to do data lookups extend to 4d and beyond? Probably, but I’m not sure how. Do you know how, or have any interesting usage cases? Share if so, it’d be interesting to hear

PS – Apparently some folks are using raytracing for GPU physics. I haven’t heard any details of how other people are doing it, but I am looking forward to getting a chance to try it myself. I’m thinking Verlet physics of particles with constraints. That amounts to only needing the current and previous particle positions to get an implicit velocity, and then doing small incremental constraint solving steps to try and make things keep their shape etc. The end result is something like screen space particles / screen space physics, except it would have knowledge of the entire scene, whereas screenspace techniques only have knowledge of the gbuffer. I’ve heard that short ray trace queries run a lot faster (20x?) by not needing to traverse the acceleration structure (BVH) as widely. With luck I’ll give it a try and write a post up about it before too long.

]]>**Void and Cluster**– This creates a greyscale blue noise texture that can be thresholded to create blue noise sample points of the desired density. More info: http://momentsingraphics.de/?p=127**Filtering White Noise**– This uses repeated iterations of filtering and histogram fixup to make blue noise (and other colors as well). More info: https://blog.demofox.org/2017/10/25/transmuting-white-noise-to-blue-red-green-purple/**Mitchell’s Best Candidate Algorithm**– This uses a variation on dart throwing to pick some points at random, and choose whichever point preserves the blueness of the samples best. More info: https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/

To be honest, the void and cluster algorithm is a “top tier” algorithm while filtering white noise is just kind of a hack, and Mitchell’s best candidate algorithm is decent, simple but a bit out dated too.

Let’s look at some 128×128 blue noise textures created via void and cluster (top) and white noise filtering (bottom). The images on the right are the frequencies of the images (DFT magnitude). Since blue noise is high frequency noise, having a darker middle means a higher quality result.

Note: the white noise filtering used 25 iterations and a sigma of 1.5 for the blurring.

They look pretty similar don’t they? It turns out they are actually pretty different which I found really surprising when I was told. I had to see this for myself.

Below we threshold both images to 10%. What I mean by that is that if we consider black to be 0 and white to be 1, we make an image where there is a black dot if the color is < 0.1, else we make a white dot.

Void and cluster is top, and white noise filtering is middle. On the bottom is blue noise sample points generated with a discrete version of Mitchell's best candidate algorithm.

As you can see, the filtered white noise has already fallen apart for our purposes. It's basically unusable for this usage case. Mitchell is doing fairly ok though.

Mitchell is gaining some low frequencies (it isn't as dark in the middle) but the filtered white noise is starting to look a tiny bit better.

Here are the rest up to 90%:

So, void and cluster beat the pants off the other two methods.

Filtered white noise used for this purpose is no good and basically fell completely apart.

Mitchell was decent until the sample density got too high and then it failed. There are some parameters to tune with this algorithm so it's possible that it could do better, but in general the algorithm does poorly for high point densities. As an alternative, above 50% density, you could perhaps invert the colors and treat it as 100-density so that it was always working against < 50% density. Even at 50% density, it isn't that great though, at least with my setup.

shadertoy.com recently got a blue noise texture and luckily the blue noise texture was made with the void and cluster algorithm, so it's "the good stuff".

Another family of algorithms to generate blue noise are based on constrained Voronoi diagrams and relaxation to evolve starting sample points to be more blue. Those are good for generating specific point sets for a specific density, but differ from void and cluster which are designed to make a texture that works well for any density.

There are other algorithms out there as well with different properties, and new ones coming out all the time. SIGGRAPH is starting right now and I bet at least one or two new blue noise algorithms are shown

Have any interesting blue noise info to share? I'd love to hear it! It feels like the rabbit hole here is a lot deeper than it seems.

]]>I started with some blue noise textures made with the void and cluster algorithm that you can download here: http://momentsingraphics.de/?p=127

I took a 16×16, 32×32, 64×64, 256×256 and tiled each of those to make 512×512 images.

Here are the source RGBA images:

Here are the output images, where I used only the R channel of each image.

16×16:

32×32:

64×64:

128×128:

256×256:

Something interesting to note is that blue noise is supposed to tile well by nature. It is made up of high frequencies only, which means there aren’t low frequency patterns that show up and are visible to the eye.

Here’s an interesting read showing how that can be used to make textures (art) that tile better:

https://www.gamasutra.com/view/feature/131482/the_power_of_the_high_pass_filter.php

The c++ code that generated this images can be found on github at https://github.com/Atrix256/U8HDRPMA

I was implementing Inigo Quilez’ “Better Fog” which is REALLY REALLY cool. It looks way better than even the screenshots he has on his page, especially if you have multiple types of fog (distance fog, height fog, fog volumes):

http://www.iquilezles.org/www/articles/fog/fog.htm

I first had it implemented as a forward render, so was doing the fogging in the regular mesh rendering shader, with all calculations being done in 32 bit floats, writing out the final result to a RGBAU8 buffer. Things looked great and it was good.

I then decided I wanted to ray march the fog and get some light shafts in, so it now became a case where I had a RGBAU8 color render target, and I had the depth buffer that I could read to know pixel world position and apply fog etc.

The result was that I had a fog color that has an HDR fog color (it had color components greater than 1 from being “fake lit”) and I knew how opaque the fog was, so I just needed to lerp the existing pixel color to the HDR fog color by the opacity. The usual alpha blending equation (The “over” operator) is actually a lerp so I tried to use it as one.

Source Blend: Source Alpha

Dest Blend: 1 – Source Alpha

Operation: Add

That becomes this, which is the same as a lerp from DestColor to SrcColor using a lerp amount of SourceAlpha.

BAM, that’s when the problem hit. My image looked very wrong, but only where the fog was thickest and brightest. I was thinking maybe it how i was integrating my fog but it wasn’t. So maybe it was an sRGB thing, but it wasn’t. Maybe it was how i was reconstructing my world position or pixel ray direction due to numerical issues? It wasn’t.

This went on and on until i realized: You can’t say “alpha blend (1.4, 0.3, 2.4) against the color in the U8 buffer using an alpha value of 0.5”. The HDR color is clamped before the alpha blend and you get the wrong result.

You can’t alpha blend an HDR color into a U8 buffer!

… or can you?!

As it turns out, premultiplied alpha came to the rescue here, but let’s look at why. As we go, we are going to be modifying this image:

Mathematically speaking, alpha blending works like this:

Using the X axis as alpha, and an overlaid solid color of (1.6, 1.4, 0.8), that gives us this:

However, if you output a float4 from your shader that is , alpha works like the below, where clamps values to be between 0 and 1:

So what happens, is that SrcColor gets clamped to be between 0 or 1 before the lerp happens, which makes the result much different:

However, using pre-multiplied alpha changes things. The float4 we return from the shader is now .

Our blend operations are now:

Source Blend: One

Dest Blend: 1 – Source Alpha

Operation: Add

That makes the blending equation become this:

The function changed to encompass the whole second term, instead of just SrcColor! That gives this result that matches the one we got when we did the lerp in shader code:

So visually things look fine, but let’s look real quick at the math involved.

If you lerp from 0.5 to 10.0 with a lerp factor of 0.2, you’d get 2.4. The equation for that looks like this:

This is what happens when doing the math in the forward rendered shader. You then write it out to a U8 buffer, which clips it and writes out a 1.0.

If you use alpha blending, it clamps the 10 to 1.0 before doing the lerp, which means that it lerps from 0.5 to 1.0 with a lerp factor of 0.2. That gives you a result of 0.6 which is VERY incorrect. This is why the HDR color blending to the U8 buffer didn’t work.

If you use premultiplied alpha blending instead, it clamps the 10.0*0.2 to 1, which means that it was 2 but becomes 1, and the result becomes 1.4. That gets clipped to 1.0 so gives you the same result as when doing it during the forward rendering, but allowing you to do it during a second pass.

This doesn’t just work for these examples or some of the time, it actually works for all inputs, all of the time. The reason for that is, the second term of the lerp is clipped to 0 to 1 and is added to the first term which is always correct. Both terms are always positive. That means that the second term can add the full range of available values (0 to 1) to the first term, and it is correct within that range. That means this technique will either give you the right answer or clip, but will only clip when it is supposed to anyways.

While I found this useful in a pinch, it’s worth noting that you may just want to use an HDR format buffer for doing this work instead of working in a U8 buffer. The reason why is even though this gives the same answer as doing the work in the shader code, BOTH implementations clip. That is… both implementations SHOULD be writing out values larger than 1.0 but the colors are clamped to being <= 1.0. This is important because if you are doing HDR lit fog (and similar), you probably want to do some sort of tone mapping to remap HDR colors to SDR colors, and once your colors clip, you've lost information that you need to do that remapping.

The red pixels below show where clipping happens:

]]>Adding depth of field (and bokeh) can make an image that looks like this:

The first image is rendered using an impossibly perfect pinhole camera (which is what we usually do in roughly real time graphics, in both rasterization and ray based rendering), and the second image is rendered using a simulated lens camera. This post is meant to explain everything you need to know to go from image 1 to image 2.

There is also a link to the code at the bottom of the post.

We are going to start off by looking at pinhole cameras – which can in fact have Bokeh too! – and then look at lens cameras.

If you don’t yet know path tracing basics enough to generate something like the first image, here are some great introductions:

- https://blog.demofox.org/2016/09/21/path-tracing-getting-started-with-diffuse-and-emissive/
- http://simonstechblog.blogspot.com/2018/06/simple-gpu-path-tracer.html
- https://aras-p.info/blog/2018/03/28/Daily-Pathtracer-Part-0-Intro/
- http://www.kevinbeason.com/smallpt/
- https://github.com/rorydriscoll/RayTracer

A pinhole camera is a box with a small hole – called an aperture – that lets light in. The light goes through the hole and hits a place on the back of the box called the “sensor plane” where you would have film or digital light sensors.

The idea is that the aperture is so small that each sensor has light hitting it from only one direction. When this is true, you have a perfectly sharp image of what’s in front of the camera. The image is flipped horizontally and vertically and is also significantly dimmer, but it’s perfectly sharp and in focus.

As you might imagine, a perfect pinhole camera as described can’t actually exist. The size of the hole is larger than a single photon, the thickness of the material is greater than infinitesimally small, and there are also diffraction effects that bend light as it goes through.

These real world imperfections make it so an individual sensor will get light from more than one direction through the aperture, making it blurrier and out of focus.

Reality is pretty forgiving though. Pinhole cameras that give decent results can be made easily, even with simple materials laying around the house (http://www.instructables.com/id/How-To-Make-A-Pinhole-Camera/).

You can even go deeper and make your own fairly high quality pinhole camera if you want: https://www.diyphotography.net/the-comprehensive-tech-guide-to-pinhole-photography/

As far as aperture size goes, the smaller the aperture, the sharper the image. The larger the aperture, the blurrier the image. However, smaller apertures also let in less light so are dimmer.

This is why if you’ve ever seen a pinhole camera exhibit at a museum, they are always in very dark rooms. That lets a smaller aperture hole be used, giving a sharper and more impressive result.

When using a pinhole camera with film, if you wanted a sharp image that was also bright, you could make this happen by exposing the film to light for a longer period of time. This longer exposure time lets more light hit the film, resulting in a brighter image. You can also decrease the exposure time to make a less bright image.

Real film has non linear reaction to different wavelengths of light, but in the context of rendered images, we can just multiply the resulting colors by a value as a post effect process (so, you can adjust it without needing to re-render the image with different exposure values!). A multiplier between 0 and 1 makes the image darker, while a multiplier greater than 1 makes the image brighter.

It’s important to note that with a real camera, longer exposure times will also result in more motion blur. To counter act this effect, you can get film that reacts more quickly or more slowly to light. This lets you have the aperture size you want for desired sharpness level, while having the exposure time you want for desired motion blur, while still having the desired brightness, due to the films ISO (film speed).

For a much deeper dive on these concepts, here is a really good read:

https://www.cambridgeincolour.com/tutorials/camera-exposure.htm

While aperture size matters, so does shape. When things are out of focus, they end up taking the shape of the aperture. Usually the aperture is shaped like something simple, such as a circle or a hexagon, but you can exploit this property to make for some really exotic bokeh effects. The image at the top of this post used a star of David shaped aperture for instance and this image below uses a heart shape.

Here’s two articles that talk about how to make your own bokeh mask for custom bokeh shapes for physical cameras:

https://photorec.tv/2017/02/diy-heart-shaped-bokeh-valentines-day/ (The image above is from this article!)

https://www.diyphotography.net/diy_create_your_own_bokeh/

Ultimately what is happening is convolution between the aperture and the light coming in. When something is in focus, the area of convolution is very small (and not noticeable). As it gets out of focus, it gets larger.

The last property I wanted to talk about is focal length. Adjusting focal length is just moving the sensor plane to be closer or farther away from the aperture. Adjusting the focal length gives counter intuitive results. The smaller the focal length (the closer the sensor plane is to the aperture), the smaller the objects appear. Conversely, the larger the focal length (the farther the sensor plane is from the aperture), the larger the objects appear.

The reason for this is because as the sensor plane gets closer, the field of view increases (the sensor can see a wider angle of stuff), and as it gets farther, the field of view decreases. It makes sense if you think about it a bit!

In the below, focal length and aperture radius are in “World Units”. For reference, the red sphere is 3 world units in radius. The path traced image is multiplied by an exposure multiplier before being shown on the screen and is only a post effect, meaning you can change the exposure without having to re-render the scene, since it’s just a color multiplier.

Here is a video showing how changing focal length affects the image. It ranges from 0.5 to 5.0. Wayne’s world, party time, excellent!

These next three images show how changing the aperture size affects brightness. This first image has an aperture size of 0.01 and an exposure of 3000.

This second image has an aperture size of 0.001 and the same exposure amount, making it a lot sharper, but also much darker.

This third image also has an aperture size of 0.001, but an exposure of 300,000. That makes it have the same brightness as the first image, but the same sharpness as the second image.

If you are wondering how to calculate how much exposure you need to get the same brightness with one aperture radius as another, it’s not too difficult. The amount of light coming through the aperture (aka the brightness) is multiplied by the area of the aperture.

When using a circular aperture, we can remember that the area of a circle is .

So, let’s say you were changing from a radius 10 aperture to a radius 5 aperture. The radius 10 circle has area of , and the radius 5 circle has an area of . That means that the radius 5 circle has 1/4 the area that the radius 10 circle does, which means you need to multiply your exposure by 4 to get the same brightness.

In the case of moving from radius 0.01 to 0.001, we are making the brightness be 1/100 of what it was, so we multiply the 3,000 by 100 to get the exposure value of 300,000.

Here is a video showing how aperture radius affects the sharpness of the image. The exposure is automatically adjusted to preserve brightness. Aperture radius ranges from 0.001 to 0.2.

In the next section we’ll talk about how to make different aperture shapes actually function, but as far as brightness and exposure goes, it’s the same story. You just need to be able to calculate the area of whatever shape (at whatever size it is) that you are using for your aperture shape. With that info you can calculate how to adjust the exposure when adjusting the aperture size.

Here are some different aperture shapes with roughly the same brightness (I eyeballed it instead of doing the exact math)

Circle:

Gaussian distributed circle:

Star of David:

Triangle:

Square:

Ring:

Even though it’s possible to do bokeh with a pinhole camera as you can see, there is something not so desirable. We get the nice out of focus shapes, but we don’t get any in focus part of the image to contrast it. The reason for this is that pinhole cameras have constant focus over distance. Pinhole camera image sharpness is not affected by an object being closer or farther away.

To get different focus amounts over different distances, we need to use a lens! Before we talk about lenses though, lets talk about how you’d actually program a pinhole camera as we’ve described it.

With the concepts explained let’s talk about how we’d actually program this.

First you calculate a ray as you normally would for path tracing, where the origin is the camera position, and the direction is the direction of the ray into the world. Adding subpixel jittering for anti aliasing (to integrate over the whole pixel) is fine.

At this point, you have a pinhole camera that has a infinitesimally small aperture. To make a more realistic pinhole camera, we’ll need to calculate a new ray which starts on the sensor plane, and heads towards a random point on the aperture.

Important note: the position of the aperture is the same as the camera position. They are the same point!

**Calculating the Point on the Sensor Plane**

We first find where the ray would hit the sensor plane if it were 1 unit behind the aperture (which will be a negative amount of time). We put that point into camera space, multiply the z of the camera space by the focal length (this moves the sensor plane), and then put it back into world space to get the actual world space origin of the ray, starting at the sensor plane.

To calculate the plane equation for the sensor plane, the normal for that plane is the camera’s forward direction, and a point on that plane is the camera position minus the camera’s forward direction. Calculating the equation for that plane is just:

sensorPlane.xyz = cameraForward; sensorPlane.w = -dot(cameraForward, (cameraPos - cameraForward));

Note that xyzw are ABCD in the plane equation .

You can then do this to find the point where the ray hits the sensor plane:

float t = -(dot(cameraPos, sensorPlane.xyz) + sensorPlane.w) / dot(rayDirection sensorPlane.xyz); sensorPos= cameraPos + rayDirection * t;

From there, you do this to adjust the focal length and to get the world space starting position of the ray:

// convert the sensorPos from world space to camera space float3 cameraSpaceSensorPos = mul(float4(sensorPos, 1.0f), viewMtx).xyz; // elongate z by the focal length cameraSpaceSensorPos.z *= DOFFocalLength; // convert back into world space sensorPos = mul(float4(cameraSpaceSensorPos, 1.0f), invViewMtx).xyz;

Now we know where the ray starts, but we need to know what direction it’s heading in still.

**Calculating the Random Point on the Aperture**

Now that we have the point on the sensor, we need to find a random point on the aperture to shoot the ray at.

To do that, we first calculate a uniform random point in a circle with radius “ApertureRadius”, since the aperture is a circle. Here is some code that does that (RandomFloat01() returns a random floating point number between 0 and 1):

float angle = RandomFloat01(state) * 2.0f * c_pi; float radius = sqrt(RandomFloat01(state)); float2 offset = float2(cos(angle), sin(angle)) * radius * ApertureRadius;

If you wanted different shaped apertures for different shaped bokeh, you are only limited to whatever shapes you can generate uniformly random points on.

If we add that random offset to the camera position in camera space (multiply offset.x by the camera’s x axis, and offset.y by the camera’s y axis and add those to the camera position), that gives us a random point on the aperture. This is where we want to shoot the ray towards.

rayOrigin = sensorPlanePosition; rayDirection = normalize(randomAperturePosition - sensorPlanePosition);

You can now use this ray to have a more realistic pinhole camera!

**Brightness**

If you want to be more physically correct, you would also multiply the result of your raytrace into the scene by the area of the aperture. This is the correct way to do monte carlo integration over the aperture (more info on monte carlo basics: https://blog.demofox.org/2018/06/12/monte-carlo-integration-explanation-in-1d/), but the intuitive explanation here is that a bigger hole lets in more light.

After you do that, you may find that you want to be able to adjust the aperture without affecting brightness, so then you’d go through the math I talk about before, and you’d auto calculate exposure based on aperture size.

When looking at the bigger picture of that setup, you’d be multiplying a number to account for aperture size, then you’d basically be dividing by that number to make it have the desired brightness – with a little extra to make it a little bit darker or brighter as the baseline brightness.

A more efficient way to do this would be to just not multiply by the aperture area, and apply an exposure to that result. That way, instead of doing something like dividing by 300,000 and then multiplying by 450,000, you would just multiply by 1.5, and it’d be easier for a human to work with.

Finally, onto lenses!

The simplest lens camera that you can make (and what I used) is to just put a convex lens inside the aperture.

Funny tangent: lens comes from the greek word for lentil. (https://jakubmarian.com/are-lens-and-lentil-related/)

A motivation for using lenses is that unlike pinhole cameras, you can increase the aperture size to let more light in, but still get a focused shot.

This comes at a cost though: there is a specific range of depth that is in focus. Other things that are too close or too far will appear blurry. Also, the larger the aperture, the smaller the “in focus range” will be.

From that perspective, it feels a bit silly simulating lenses in computer graphics, because there is no technical reason to simulate a lens. In computer graphics, it’s easier to make a sharper image than a blurry one, and if we want to adjust the image brightness, we just multiply the pixels by a constant.

Simulating a lens for depth of field and bokeh is purely a stylistic choice, and has nothing to do with a rendering being more correct!

**How Convex Lenses Work**

Convex lenses are also called converging lenses because they bend incoming light inwards to cross paths. Below is a diagram showing how the light travels from objects on the left side, through the lens, to the right side. The light meets on the other side of the lens at a focus point for each object. The orange “F” labels shows the focal distance of the lens.

If two points are the same distance from the lens on the axis perpendicular to the lens, their focal points will also be the same distance from the lens on that axis, on the other side of the lens.

This means that if we had a camera with a sensor plane looking through a lens, that there would be a focal PLANE on the other side of the lens, made up of the focus points of each point for each sensor on the sensor plane. Things closer than the focus plane would be blurry, and things farther than the focus plane would be blurry, but things near the focus plane would be sharper.

The distance from the camera (aperture) to the focal plane is based on the focal distance of the lens, and also how far back the sensor plane is. Once you have those two values, you could calculate where the focal plane is.

There is a simpler way though for us. We can skip the middle man and just define the distance from the camera to the focal plane, pretending like we calculated it from the other values.

This is also a more intuitive setting because it literally tells you where an object has to be to be in focus. It has to be that many units from the camera to be perfectly in focus.

Going this route doesn’t make our renderer any less accurate, it just makes it easier to work with.

Nathan Reed (@Reedbeta) has this information to add, to clarify how focus works on lens cameras (Thanks!):

The thing you change when you adjust focus on your camera is the “image distance”, how far the aperture is from the film, which should be greater than or equal to the lens focal length.

The farther the aperture from the sensor, the nearer the focal plane, and vice versa. 1/i + 1/o = 1/f.

And this good info too:

“focal length” of a lens is the distance from film plane at which infinite depth is in sharp focus, and is a property of the lens, eg “18mm lens”, “55mm lens” etc. The focal length to sensor size ratio controls the FOV: longer lens = narrower FOV

Programming a lens camera is pretty simple:

- Calculate a ray like you normally would for a path tracer: the origin is the camera position, and the direction is pointed out into the world. Subpixel jitter is again just fine to mix with this.
- Find where this ray hits the focal plane. This is the focal point for this ray
- Pick a uniform random spot on the aperture
- Shoot the ray from the random aperture position to the focal point.

That’s all there is to it!

You could go through a more complex simulation where you shoot a ray from the sensor position to a random spot on the aperture, calculate the refraction ray, and shoot that ray into the world, but you’d come up with the same result.

Doing it the way I described makes it no less accurate(*)(**), but is simpler and computationally less expensive.

* You’ll notice that changing the distance to the focal plane doesn’t affect FOV like changing the focal distance did for the pinhole camera. If you did the “full simulation” it would.

** Ok technically this is a “thin lens approximation”, so isn’t quite as accurate but it is pretty close for most uses. A more realistic lens would also have chromatic aberration and other things so ::shrug::

You can optionally multiply the result of the ray trace by the aperture size like we mentioned in the pinhole camera to make the brightness be properly affected by aperture size. If you’d rather not fight with exposure multiplier calculations as you change aperture size though, feel free to leave it out.

Here are some links for more information on lenses:

http://www.physicsclassroom.com/class/refrn/Lesson-5/Converging-Lenses-Ray-Diagrams

https://computergraphics.stackexchange.com/questions/4344/depth-of-field-in-path-tracing-what-do-i-do-with-the-secondary-ray

https://en.wikipedia.org/wiki/Thin_lens

http://www.passmyexams.co.uk/GCSE/physics/concave-lenses-convex-lenses.html

This video shows the effect of the aperture size changing. Notice that the area in focus is smaller with a larger aperture radius.

This video shows the effect of the focal distance changing. Nothing too surprising here, it just changes what depth is in focus.

Even after I had things implemented correctly, I was having trouble understanding how to set the parameters to get good Bokeh shots, as you can see from my early images below:

Luckily, @romainguy clued me in: “Longer focals, wider apertures, larger distance separation between subjects”

So what I was missing is that the bokeh is the stuff in the background, which you make out of focus, and you put the focal plane at the foreground objects you want in focus.

It’s a bit strange when you’ve implemented something and then need to go ask folks skilled in another skill set how to use what you’ve made hehe.

Here’s some other links I found useful while implementing the code and writing this post:

https://en.wikipedia.org/wiki/Pinhole_camera_model#The_geometry_and_mathematics_of_the_pinhole_camera

https://en.wikipedia.org/wiki/Camera_lens#Theory_of_operation

https://www.scratchapixel.com/lessons/3d-basic-rendering/3d-viewing-pinhole-camera/virtual-pinhole-camera-model

https://en.m.wikipedia.org/wiki/Circle_of_confusion

My GPU path tracer that generated this images is up on github.

It’s a work in progress so is missing some things, has some todo notes, and maybe has some things that are incorrect in it, be warned!

The code is here: https://github.com/Atrix256/FalcorPathTracer/releases/tag/v1.0

The path tracer uses nvidia’s “Falcor” api abstraction layer. As best as I can tell, just pulling down falcor to your machine and compiling it registers it *somewhere* such that projects that depend on falcor can find it. I’m not really sure how that works, but that worked for me on a couple machines I tried it on strangely.

This is the version / commit of Falcor I used:

https://github.com/NVIDIAGameWorks/Falcor/commit/0b561caae19e8325853166cc4c93d4763570774a

I wish I had a more fool proof way to share the code – like if it were to download the right version of falcor when you try to build it. AFAIK there isn’t a better way, but if there is I would love to hear about it.

Anyhow, happy rendering!!

]]>We could solve this specific problem by doing some algebra and calculus to get the exact answer analytically (which is ), but let’s pretend like we can’t, or don’t want to solve it that way.

Another way to solve this problem is to use Monte Carlo integration, which lets you solve it numerically and get an approximated answer.

How you would do that is like this:

- Pick a random number between 0 and pi.
- Plug that value into the function as x to get a y value.
- Do this multiple times and take the average to get the average y value of the function.
- Pretending that the function is a rectangle, you can use the average y as the height of the rectangle, and use pi as the width because we are looking between 0 and pi.
- Multiply that width and height to get the area of a rectangle, which is the estimated area under the curve.

That’s all you need to do!

Monte Carlo integration is pretty powerful in how simple it is, and how it works really well even in extremely high dimensions.

As you might imagine, the more samples you take to get your average y value, the better your estimate is going to be. Unfortunately though, you have to quadruple the number of samples you have to cut the error in half, so it can take a while to get the correct answer (converge) if you need a high level of accuracy. (https://en.wikipedia.org/wiki/Monte_Carlo_method#Integration)

Here’s a C++ code snippet doing this process with 10,000 samples. Each time you run the program you’ll get a different estimate. If you take more samples, you’ll more reliably get a better answer.

double SimpleMonteCarlo() { double rangeMin = 0; double rangeMax = 3.14159265359; size_t numSamples = 10000; std::random_device rd; std::mt19937 mt(rd()); std::uniform_real_distribution<double> dist(rangeMin, rangeMax); double ySum = 0.0; for (size_t i = 1; i <= numSamples; ++i) { double x = dist(mt); double y = sin(x)*sin(x); ySum += y; } double yAverage = ySum / double(numSamples); double width = rangeMax - rangeMin; double height = yAverage; return width * height; }

Below is the output of the code ran 5 times. Note that the real answer is which is 1.57079632679.

- 1.548451
- 1.554312
- 1.576727
- 1.578759
- 1.598686

(I’m actually a bit disturbed that the 5 runs are actually sorted from low to high but whatever …)

A problem with this being based on regular old random numbers (white noise) is that sometimes the numbers will clump, giving too much weighting to one area of the function, and leave empty space where another part of the function wasn’t sampled at all.

There are many different ways to deal with this situation but two of my favorites are…

- Blue Noise: https://blog.demofox.org/2018/01/30/what-the-heck-is-blue-noise/
- Low discrepancy sequences: https://blog.demofox.org/2017/05/29/when-random-numbers-are-too-random-low-discrepancy-sequences/

Both of those things give more even coverage over the sampling space which means that you won’t have as large gaps of missing information from your samples.

Another way to help this is stratified sampling, where you break the sampling space up into some number of sections, and choose random numbers within each section, making sure to have samples in each of the sections. That keeps the randomness, but gives more even coverage over the sampling space.

You might be tempted to just say “If I’m taking 100 samples, i’ll just sample every 1/100th of the space evenly”. That uniform / regular sampling has some problems including aliasing, but also loses some of the positive mathematical properties that random numbers can give you (like, being able to sample from non rational numbered locations!).

A variation on stratified sampling is a technique invented by Pixar called “jittered grid” where you do even sampling, but add a small random value to each sample.

There are lots and lots of other techniques which could make up a long list of blog posts, so we’ll stop there!

The last section was actually a simplified version of a Monte Carlo integration which was able to be simplified because it was using uniform random numbers.

Monte Carlo integration works with random numbers that have arbitrary distributions as well, not just uniform random numbers.

The process works mostly the same but there are a couple differences.

In the previous section, we got an average height and then multiplied by the width to get an estimate of the area under the curve, pretending that it was a rectangle.

The first change is to move the multiplication by the width into the loop. Instead of calculating an average height, we are instead calculating average rectangle areas.

Mathematically you get the same answer, so there’s nothing crazy there.

The second change is that instead of multiplying by the width, you divide by the probability of the number being chosen, that you plugged into the equation.

In the case of our function that we are taking samples of between 0 and pi, the probability of any single number being chosen in that range is . When we divide by that, it means we end up just multiplying by pi, so it’s mathematically equivalent to what were were doing before!

Here’s the steps for the more generalized monte carlo integration:

- Pick a random number between 0 and pi using any random number distribution you’d like to.
- Plug that value into the function as x to get a y value.
- Divide that y value by the probability of having chosen that number (otherwise known as PDF(x)) to get an estimated area of the function.
- Do this multiple times and take the average to get your result.

Here is some code to do the more general Monte Carlo integration, still using uniformly distributed random numbers.

double GeneralMonteCarlo() { size_t numSamples = 10000; std::random_device rd; std::mt19937 mt(rd()); std::uniform_real_distribution<double> dist(0.0f, 1.0f); auto InverseCDF = [](double x) -> double { return x * c_pi; }; auto PDF = [](double x) -> double { return 1.0f / c_pi; }; double estimateSum = 0.0; for (size_t i = 1; i <= numSamples; ++i) { double rnd = dist(mt); double x = InverseCDF(rnd); double y = sin(x)*sin(x); double pdf = PDF(x); double estimate = y / pdf; estimateSum += estimate; } double estimateAverage = estimateSum / double(numSamples); return estimateAverage; }

Interestingly, dividing by the PDF is the same mathematically as multiplying by width in the last section – it literally ends up being a multiplication by pi (the width). The only difference is that we pulled the multiply into the loop, instead of leaving it until the end.

As an optimization, you could definitely move the divide out again (and turn it into a multiply), but I wanted to present the code as close to the core concepts as possible.

Let’s try sampling from a different random number distribution. Let’s generate random numbers which have a distribution of . You can see it compared to the function we are integrating below. They are fairly similarly shaped!

To use as a random number distribution for monte carlo integration, we’ll need to calculate the normalized PDF and we’ll also need to calculate the inverse CDF.

If you want to know more about PDFs and “whatever an inverse CDF may be”, give this a read: Generating Random Numbers From a Specific Distribution By Inverting the CDF

- The function is normalized to this PDF:
- To generate numbers from that PDF, you take a random number that is between 0 and 1 and plug it into this function, which is the inverse CDF:

Here is a code snippet doing monte carlo integration with this PDF and inverse CDF:

double ImportanceSampledMonteCarlo() { size_t numSamples = 10000; std::random_device rd; std::mt19937 mt(rd()); std::uniform_real_distribution<double> dist(0.0, 1.0); auto InverseCDF = [](double x) -> double { return 2.0 * asin(sqrt(x)); }; auto PDF = [](double x) -> double { return sin(x) / 2.0f; }; double estimateSum = 0.0; for (size_t i = 1; i <= numSamples; ++i) { double rng = dist(mt); double x = InverseCDF(rng); double y = sin(x)*sin(x); double pdf = PDF(x); double estimate = y / pdf; estimateSum += estimate; } double estimateAverage = estimateSum / double(numSamples); return estimateAverage; }

To compare this versus uniform random sampling, I'll show the progress it makes over 50,000,000 samples first using uniform random numbers, then using the shaped PDF.

**Uniform aka 1/pi:**

**sin(x):**

You may notice that every 4x samples, the standard deviation (which is the square root of variance) drops in half, like we talked about before. This is why path tracing takes so long. If you don’t know what path tracing is, this is why modern animated movies take so long to render.

In the results, you can see that the variance of the estimates is a lot lower using this PDF that is shaped more like the function we are trying to integrate. We got a better, more reliable answer with fewer samples. Is that pretty cool? You bet it is! When you use a PDF shaped like the function you are integrating, to get better results faster, that is called importance sampling.

If you use a PDF which is shaped very differently from the function you are trying to integrate, you will get more variance and it will take longer to converge, which is a total bummer.

Let’s try , which doesn’t look much like the function we are trying to integrate at all:

Here is the PDF and inverse CDF:

Here it is with 50,000,000 samples:

And here is the uniform sampling again as a comparison:

As you can see, it is approaching the right answer, but is taking about 10 times as long to get the same results (amount of variance) compared to uniform sampling. Ouch!

Let’s say that we got really lucky and somehow got the PDF and inverse CDF for a function that perfectly matched the function we were trying to integrate. What would happen then?

Let’s check it out by integrating the function by using a random number distribution which has the form .

We already calculated the PDF and inverse CDF of that function earlier:

Here we do that with 50,000,000 samples:

WOW! As you can see, it had the right answer from the first sample, with zero variance (randomness) and it kept steady at that answer for all 50,000,000 samples.

This is a pretty neat concept, and if you know about “cosine weighted hemisphere sampling”, that does this exact thing.

Cosine weighted hemisphere samples are weighted such that you can remove the from the lighting calculations, because the random number distribution handles it for you.

It basically removes that part of randomness from the equations.

Unfortunately there are more variables and randomness in path tracing than just that term, but it helps.

Beyond this, you’d start look at other variance reduction techniques if you were interested, including multiple importance sampling.

Going into this blog post I thought “hey no sweat, i’ll make a few simple functions, calculate their PDFs, inverse CDFs and be on my way”.

I can’t believe how almost all the simple functions I tried ended up being impossible to take through the process.

for instance, you can take and solve for y to get , but if you try to solve for y, you are going to have a bad day!

I think in the future if I need to do something like this, I’d like to try fitting a curve to the (x,y) data points reordered as (y,x) data points, but there are many other methods for doing this sort of thing as well.

BTW if wondering how I was calculating std dev (aka square root of variance) while integrating, variance is “The average of the squared differences from the mean”. That means that if you know the correct answer of what you are trying to integrate, you can calculate the std dev like this:

// Variance is "The average of the squared differences from the mean" double difference = integration - actualAnswer; double differenceSquared = difference * difference; averageDifferenceSquared = Lerp(averageDifferenceSquared, differenceSquared, 1.0 / double(i)); double stdDev = sqrt(averageDifferenceSquared);

- integration is the current average estimate (if you have taken 100 samples, it’s the average of the 100 samples)
- actualAnswer is the known right answer
- averageDifferenceSquared is also the variance
- i is the number of samples you have taken, including the current one (aka start at 1, not 0)
- If you are confused about me doing a lerp to calculate an average, give this a read: Incremental Averaging

Hope you enjoyed this write up!

Anders Lindqvist (@anders_breakin) is writing up a blog post explaining monte carlo, importance sampling, and multiple importance sampling that you might be interested in if you enjoyed this. Give him a follow, and it’ll be coming out soon

Also, here is a really nice twitter thread talking about why importance sampling actually works:

https://twitter.com/Atrix256/status/1003487338633105409

double SimpleMonteCarlo() { double rangeMin = 0; double rangeMax = 3.14159265359; size_t numSamples = 10000; std::random_device rd; std::mt19937 mt(rd()); std::uniform_real_distribution dist(rangeMin, rangeMax); double ySum = 0.0; for (size_t i = 1; i <= numSamples; ++i) { double x = dist(mt); double y = sin(x)*sin(x); ySum += y; } double yAverage = ySum / double(numSamples); double width = rangeMax - rangeMin; double height = yAverage; return width * height; }

That was a test

]]>