Generating Random Numbers From a Specific Distribution With The Metropolis Algorithm (MCMC)

There is ~400 lines of standalone C++ code that implements the main ideas in this post. You can find it at: https://github.com/Atrix256/MetropolisMCMC

In previous posts I showed how to generate random numbers from a specific distributing by using two techniques:

Rejection Sampling: https://blog.demofox.org/2017/08/08/generating-random-numbers-from-a-specific-distribution-with-rejection-sampling/

Inverting the CDF: https://blog.demofox.org/2017/08/05/generating-random-numbers-from-a-specific-distribution-by-inverting-the-cdf/

This post will show how to do it using a Markov Chain Monte Carlo method called “The Metropolis Algorithm”. This post also talks about using it for numerical integration.

If you want an intro or review to either Markov Chains or Monte Carlo, these two posts can help you out.

Monte Carlo: https://blog.demofox.org/2018/06/12/monte-carlo-integration-explanation-in-1d/

Markov Chains: https://blog.demofox.org/2019/05/11/markov-chain-text-generation/

Overview

The Metropolis algorithm lets you generate random numbers that follow a distribution given by any function y=f(x), where y is the probability of choosing x.

Rejection sampling does this as well, but you have to throw away an unknown number of bad samples before getting each good sample.

Inverting the CDF doesn’t throw out any samples, but it’s limited in the type of distributions it can do: It can be mathematically complex, or impossible, to find the inverted CDF for a given function analytically.

In these ways, the Metropolis algorithm is the best of both worlds. You can sample from a distribution defined from any function, and you don’t have to throw out any samples while doing it.

Another interesting thing about the Metropolis algorithm is that it can work blindly. It never actually has to KNOW what function describes the probability distribution. If there is a black box you can give an x to get a y, the Metropolis algorithm can generate random numbers from the distribution hidden in that black box.

The Metropolis algorithm also works in any dimension: you can use it with functions like z=f(x,y) or t = f(x,y,z,w,s).

In higher dimensions such as z=f(x,y), the z is the probability of a 2d random number (x,y) being chosen. It sounds weird but this situation is like if you rolled two dice, the value that one die came up as affected the probability of the other die. Maybe when the first die was a larger number, it made it be more probable for the other die to be a larger number too.

As weird as that is, if you can describe the relationship as a z=f(x,y) function, this can generate (x,y) random numbers from that distribution for you.

It’s worth noting that the Metropolis algorithm is a simpler special case of the Metropolis-Hastings algorithm, and these are just two of many Markov Chain Monte Carlo algorithms.

The Metropolis Algorithm

The metropolis algorithm is pretty simple.

You start with an x value and calculate y which is just f(x). This is the initial sample and hopefully is a location where y is greater than 0.

To get the next sample, we first need to calculate a candidate sample, and then choose whether to take it or not.

To make a candidate sample, take a small random step from the current x point to get a new x, either in the negative or positive direction. Calculate y which is just f(x) using the new x.

Calculate a value A (for acceptance value) which is the candidate y divided by the last y value. This is the percentage chance you should take the new sample as the current sample, otherwise you take the old sample as the current sample.

To make the decision, you just generate a random number between 0 and 1 and accept the new sample if it’s below the A value.

Rinse and repeat for as many samples as you want.

Here’s a simple but fully featured implementation that you can find in the code that goes with this post(https://github.com/Atrix256/MetropolisMCMC)

The x axis of each sample is a random number drawn from the distribution described by y=f(x). If you keep track of the average x value seen, you’ll get the expected value of the PDF.

The y axis of each sample isn’t that useful directly. It’s the pdf(x) but scaled up by an unknown amount – the normalization constant for the function.

The convergence rate of Metropolis MCMC isn’t as well understood as monte carlo integration, since Metropolis has dependent samples (a random walk that knows where it was last step) vs independent samples (a stateless random sample).

In higher dimensions, you just take a random step on each axis, instead of only the x axis. The rest of the algorithm remains the same.

Regarding the random walk, it’s possible to use many different types of random numbers to take a step, but it’s most common to use a normal distribution.

Metropolis Burn in and 0.234

Metropolis is stateless, and has no memory of the past. Despite this, it’s common to do a “burn in” with MCMC where you throw out some number of samples before you start.

This might sound weird, but the reason for it is that there are good places to sample from and bad places to sample from, and this is an attempt to have the random walk find a better place to sample from.

For instance, if you were sampling from y=sin(x)*sin(x)*x from 0 to 2 pi, you’d have a pdf that had a shape like the bimodal function below:

If you started the random walk at 2, you’d be doing a random walk in the left, less probable hump, and it would be hard to break out into the right side which was supposed to be more probable.

You should eventually get into the larger hump and stay there longer, but your initial guesses may get stuck in a local minimum and not do a very good job of following the distribution.

While burn in can help situations like these, this is also an example of how the Metropolis algorithm can fail.

It’s worth noting that using a normal distribution for the random walk can help it not get stuck in bad places. If you have a uniform random distribution that can generate numbers between -k and +k, you can get stuck in situations where you need to take a larger than k step to get to a better (more probable) location. If instead you use a normal distribution, the sigma allows you to have a good idea of how big most of the steps will be, but there will always be a greater than zero chance that you can take a step of ANY size, which could get you out of a local minimum.

There is a rule of thumb that the step size you use (the sigma in the case of a normal distribution) should make it so you are accepting a new step on your random walk 23.4% of the time. This supposedly is a good balance between exploration (finding better places elsewhere) and exploitation (staying in a good place) in many situations.

This isn’t fully settled though, as this is only true some of the time, and counter research has come up. (https://www.sciencedirect.com/science/article/pii/S0304414907002177)

I experimented at having a “Burn in” phase where it also adjusted the sigma to try and reach that 23.4% acceptance rate over the previous N samples. I didn’t play with it very long though, and was unable to get it to reliably reach that acceptance rate.

Even beyond the 0.234 acceptance rate goal, tuning the step size for your specific situation can help you get better or worse results. I didn’t play around with that much in my experimentation though, and found a sigma of 0.2 worked pretty well when working with functions in the 0-pi and 0-2pi range.

The initial starting point of your random walk can affect performance too obviously, since the burn in stage is supposed to find a better starting position.

Limiting the Function’s Range

In one of the tests I did, I used the function y=sin(x) with x going from 0 to pi/2.

At first i tried clamping x to 0 to pi/2, to keep it in range but doing that made the technique fall apart. There were plenty of times the random walk would try to step out of bounds, but instead of taking that step, it would clamp to the border. This meant that the random walk had a significantly higher chance of reaching the boundary than anywhere else on the graph, and that broke the algorithm.

The correct thing to do was to just make the function return 0 when x was out of range. In this way, it would end up taking any out of range location with 0% probability, but there was no bias about more or less likely places visited by the random walk, other than it preferring higher (more probable) parts of the function.

Something else worth noting about the function is that if the function ever returns a negative number, it ends up being the same as if it returned zero probability.

If use Metropolis MCMC for integration, this can be an important fact, because it will basically ignore the negative values, and treat them as zero.

Discrete Case

As described, this algorithm works with continuous random numbers, drawing them from a PDF.

The same concepts work for discrete states though too (a more traditional looking markov chain), drawing from a PMF instead.

When handling the discrete case, it needs to be possible to be in any state at any point in time. A usual way to avoid the edge case of probabilities being such that you can only be in some nodes on even steps and others on odd steps, is to have a “self loop” on at least one node, which has a greater than zero probability of staying at the same node.

This page has some great info about the discrete case:
Markov Chain Monte Carlo Without all the Bullshit

Integration

Using the Metropolis algorithm for numerical integration is possible, but is not as straightforward as Monte Carlo integration.

In Monte Carlo integration, to get a single estimate of an integral you calculate f(x) / pdf(x). f(x) is the function value at the random location x, and pdf(x) is the probability of that x value being chosen. You take the average of N such estimates and as N approaches infinity, the error of the average of estimates approaches 0.

In Metropolis MCMC we do have N number of samples, and it almost seems like we have enough data to do this, but it turns out that we don’t.

For f(x) which is the function value at the random location, you literally do have f(x). It’s the y component of each sample generated.

The problem is that we don’t have pdf(x).

The probability of choosing x is in fact based on the function we are evaluating f(x), but the function is essentially an un-normalized pdf, but we are able to draw random numbers from the pdf without knowing the normalization constant.

So, pdf(x) is some scalar multiple of f(x), but we have no idea what the multiplier is. That multiplier, the normalization constant, turns out to be the integration value we want to search for.

So we’ve gone in a circle and are no closer to being able to integrate with Metropolis.

There are some ways to deal with this though.

One way is to do mathematical tricks to make it so things “cancel out” and leave the normalization constant.

There is something called the “Harmonic Mean Estimator” that does this, but has infinite variance so is called “The Worst Monte Carlo Method Ever”.
https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/

There is another way though, that I use in the code that goes with this post.

Imagine that while you are doing your Metropolis MCMC you have some interval [a,b] that whenever you get a random number drawn in that range, you increment a counter.

After N total samples, you’ll have M samples that fell in this interval. An estimate of the integral of the normalized pdf over this interval is M/N.

Now, you can do regular Monte Carlo integration of the function over this range to get the integral of the UN-normalized pdf over this interval.

When you divide the unnormallized pdf value by the normalized pdf value, you’ll get the normalization constant aka the integral of the function.

A smaller interval size is better for the Monte Carlo integration because it will converge faster (better results in fewer samples), but it’s worse for the Metropolis integration because a smaller interval is less likely to be accurate with the random walk.

My intuition tells me that if you keep a histogram of the x values you’ve seen be generated from the Metropolis algorithm, that the ones with higher counts are more likely to be accurate. So I just integrate over whatever histogram bucket has the highest count. I haven’t done any real analysis of whether or not this is true, or how good this integration estimate is in general.

For the Monte Carlo integration, I used white noise (regular old random numbers) to integrate, but in reality you’d get much better results from something like sobol. I used white noise because i made the code generic for N dimensions and white noise generalizes to any dimension.

Experiment Results

I didn’t play around much with initial guesses, sigmas, trying to reach the 0.234 acceptance rate, or burn in, but here’s some results from the code that goes with the post.

In the below, the blue line – normalized function value – is the actual desired PDF . The red line – Percentage – is how many samples we actually got in that histogram bucket. When these lines match up, we are happy and everything worked like we wanted it to.

y=sin(x) x in [0,pi]

y=|sin(x)| x in [0, 2pi]

y=sin(x)*sin(x) x in [0, 2pi]

It’s interesting to see the last one be so far from the real PDF. That function must trap the random walk in one side or the other a bit too much.

There is also a 2d function z=f(x,y) that is tested in the c++ code that goes with this post. I don’t know of any easy ways to make a 2d histogram so don’t have any results to show.

Links and Closing

The Metropolis algorithm is pretty neat but it’s just the beginning of MCMC methods. I’ve heard that Hamiltonian Monte Carlo can give much better results by using derivatives to make more intelligently sized step sizes.

Something I find interesting is that plain Monte Carlo uses white noise, quasi Monte Carlo uses low discrepancy sequences (and i think blue noise would fit in here), while Metropolis MCMC uses a random walk, which is red noise.

I’m not sure what to make of that, but my brief reading about Hamiltonian Monte Carlo was that it allows the samples to be less dependent, and that’s why it improves things. Maybe there are some secrets to red noise, like there are for blue noise? I’m not really sure but will keep looking ๐Ÿ˜›

A great write up on Metropolis MCMC
https://stephens999.github.io/fiveMinuteStats/MH_intro.html

Another small but useful write up
http://www.pmean.com/07/MetropolisAlgorithm.html

A 35 minute video about Metropolis MCMC

A mathier set of videos about Metropolis MCMC that is actually very easy to understand:

“A Zero-Math Introduction to Markov Chain Monte Carlo Methods”
https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50

A more mathy overview of Metropolis
https://ermongroup.github.io/cs323-notes/probabilistic/mh/

A series of posts aimed at being a gentle introduction to MCMC
https://theclevermachine.wordpress.com/tag/monte-carlo-integration/

A mutli branch twitter thread talking about some interesting MCMC related things

“Introduction to MCMC”

Introduction to MCMC

“MCMC Burn In”

MCMC burn-in

Markov Chain Text Generation

This post includes a standalone (only standard headers, no external libs) ~400 line C++ source file that can analyze text and use an order N Markov chain to randomly generate new text in the same style. The Markov code itself is fairly generic / re-usable and a template parameter to the class lets you specify the order of the chain as well as the type of state data to use. That code is on github at: https://github.com/Atrix256/TextMarkovChain

When I see material on Markov chains, it usually comes in two flavors:

  1. Very Mathy
  2. Pretty impressive results light on explanation

It turns out the reason for this is because they CAN be very mathy but they can also be extremely simple.

Without knowing this, I decided it was time to learn about Markov chains. I leveled up my linear algebra knowledge a bit, finally getting a solid grasp on eigen vectors, and learning things like how to put a matrix into an eigen basis form to be able to make matrix exponentiation a trivial operation. There are links at bottom of post if you want to learn this stuff too.

Then, I sat down to learn Markov chains and nearly flipped my table over! Yes, Markov chains can be mathy (and matrix exponentiation is one way to find a Markov chain steady state, but not the best), but that stuff isn’t really required for most uses.

Markov Chains

A Markov chain is just any situation where you have some number of states, and each state has percentage chances to change to 0 or more other states.

You can get these percentages by looking at actual data, and then you can use these probabilities to GENERATE data of similar types / styles.

Example

This post uses Markov chains to generate text in the style of provided source text.

The first step it does is analyze source text.

To analyze the source text, it goes through text, and for each word it finds, it keeps track of what words came next, and how many times those words came next.

When analyzing the story “The Tell-Tale Heart” by Edgar Allan Poe for instance (https://poestories.com/read/telltaleheart , also is data/telltale.txt in the code that goes with this post), here are the words that came after “when” and their counts.

  • all – 1
  • enveloped – 2
  • he – 1
  • i – 4
  • my – 2
  • overcharged – 1
  • the – 1

Here are the counts for the words that appear after “is”:

  • but – 1
  • impossible – 1
  • merely – 1
  • nothing – 1
  • only – 1
  • the – 2

After all these counts have been gathered up, the next step is to convert them into probabilities. You do this by summing up the words that come after a specific word, and dividing the count of each word by that total sum.

The above examples then turn from counts to probabilities. Here is “when”:

  • all – 8%
  • enveloped – 16%
  • he – 8%
  • i – 33%
  • my – 16%
  • overcharged – 8%
  • the – 8%

Here is “is”:

  • but – 14%
  • impossible – 14%
  • merely – 14%
  • nothing – 14%
  • only – 14%
  • the – 28%

Note: The code that goes with this post spits out these counts and percentages in the “out/stats.txt” file if you ever want to see the data.

Once the probabilities are known, you can start generating text. The first thing you do is pick a word purely at random, this is the first word in the text.

Next, you use the probabilities of what words come after that word to randomly choose the next word.

You then use the probabilities of what words come after that word to randomly choose the next word.

This repeats until you’ve generate as much text as you want.

The code with this post generates 1000 words into the “out/generated.txt” file.

That is literally all there is to it. You could do this same process with sheet music to generate more music in the same style, you could do it with weather forecasts to generate realistic weather forecasts (or even try to use it to predict what weather is next). You can do this with any data you can imagine.

Example Generated Output

Here is 100 words of generated text from various sources.

First is text generated from “The Tell-Tale Heart” by Edgar Allan Poe (https://poestories.com/read/telltaleheart):

…About trifles, and with perfect distinctness — very slowly, my sagacity. I then took me, louder — you cannot imagine how stealthily — with what caution — cautiously — would have told you may think that no longer i knew that no blood – spot. He would not even his room, to do the hour had made up my whole week before him. I knew what dissimulation i showed them causeless, undisturbed. Now a hideous heart, no — wide open — all and the old man, and he would have…

Here is text generated from “The Last Question” by Isaac Asimov (http://hell.pl/szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm):

…Glory that. Man said, it into a meaningful answer. Granted, said, might be kept from the entire known to restore the universe for meaningful answer. Mq – talkie robot, ac learned how many stars are dying. The boys appreciated that not. Cosmic ac that, how may be able to reach the small station, said at half the same. He shrugged. We’ll have enough to be alone. And lose itself aloof. When any other kind of universal ac. He consisted of individuals were self – contact…

Here is text generated from a research paper “Projective Blue-Noise Sampling” (http://resources.mpi-inf.mpg.de/ProjectiveBlueNoise/ProjectiveBlueNoise.pdf):

…Numerical integration. Mj patterns to vector multiplication to achieve a way that the above question whether there exist distributions have addressed anisotropic classic lloyd relaxation green and rotated pattern significantly worse than the j 1, where each site: our projective blue – noise point distributions along both axes. Previous work sampling when undergoing one after a certain number of common blue noise patterns, but at the publisher s ., cohen – left constructs a quality of latinizing the non – sample counts however, as a set only in a theory 28, this shrinkage…

Here is text generated from an example (not real, but representative) psych report from my wife who is a school psychologist:

…Brother had to mildly impaired body movement, the school and placement after a 90 probability that student: adapting to struggle as video games. Student’s planning and he request, spelling subtest scores. This time. The student: this time and accurately with both, including morphology, 2013. Administrators should consider participation in the following are student as intellectually disabled specific auditory comprehension of reading: mr. Mrs. The two subtest is designed to use of or economic disadvantages, gestures, vitality or economic disadvantages, picking at approximately 5th grade prior…

Here we generate a markov chain using ALL the above source texts, to get a mash up of all of them.

…Restore the sphere packing radius is likely an adaptive skills. Please see inset in the conner s problems, we’ll just have well and visualization and he is computed on 1 2 was contacted by things, and restricted number of his abilities. We can simply like them, as well as a s difficulty interacting with a closer to cry, the process based on the standards – appropriate to spurious aliasing artefacts mit87, making a meaningful answer. Finally, 11 months through hyperspace to try his eye contact. Jerrodine’s eyes were going out if…

Lastly, here is only Poe and Asimov combined:

…Could not forever, and continually increased. And stood for a sudden springing to get back and the eighth night i to that man, 2061, but the original star and made trips. A very, and fell full youthfulness even to feel — i then stop someday in five words on a while i heard all the noise steadily for us, calling him to pluto and now a galaxy alone pours out, quick sound would think of individuals. He stirred his hideous veil over the ceiling. Twenty billion years ago, man, …

Nth Order Markov Chains

Using one word to generate the next word works somewhat well – the generated Poe text definitely seemed like Poe for instance – but there are plenty of times when things don’t make much sense.

A markov chain can become higher order when you don’t just look at the current state to transition to the next state, but you look at the last N states to transition to the next state.

In the text generation case, it means that a 2nd order Markov chain would look at the previous 2 words to make the next word. An order 3 markov chain would look at the previous 3 words to make the next word.

Interestingly, an order 0 Markov chain looks at NO WORDS to generate the next word, so is purely random word generation, with similar word counts (by percentage) as the original text.

The code that goes along with this post lets you specify the order on the Markov chain.

Here is “The Tell-Tale Heart” with an order two markov chain.

…Dark as midnight. As the bell sounded the hour, there came to my ears: but he had been too wary for that. A tub had caught all — ha ha when i describe the wise precautions i took for the concealment of the old man sprang up in bed, crying out — no blood – spot whatever. I removed the bed and examined the corpse. Yes, he was stone, stone dead. I knew that he had been lodged at the police. A watch’s minute hand moves more quickly than did…

If you compare that to the actual story, you can find fairly large sections of that are taken verbatim from the source text, but the arrangement of those larger chunks are different.

The reason for this is that when you have two words mapping to the next word, the number of these go up, which makes it so on average, there are going to be fewer choices for “next words”, which make the results less random, and more deterministic.

If you gave it more text (like, maybe, all of Edgar Allan Poe’s work), there would be more options for the next word after specific 2 word pairs, but with a single short story, it doesn’t have very many choices. If you look at the out/stats.txt file and compare order 1 vs order 2, you can see that order 2 has a lot more situations where a current state maps to a single next state.

At order 3 there are even fewer choices, and it hits a pattern loop:

…Had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to fear there entered three men, who introduced themselves, with perfect suavity, as officers of the police. A shriek had been heard by a neighbor during the night; suspicion of foul play had been aroused; information had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to…

Here is an order 2 mashup of Poe and Asimov:

…Crossing the floor, and still chatted. The universal ac interrupted zee prime’s own. It had to be contrary, and jerrodette i. Ask multivac. As the passage through hyperspace was completed in its place, each cared for by perfect automatons, equally incorruptible, each with its dreadful echo, the real essence of men was to be contrary now, now, honeys. I’ll ask microvac. Don’t shout. When the sun, and their only concern at the visiplate change as the frightened technicians felt they could hold their breath no…

Lastly, here’s an order 2 mashup of all 4 source texts:

…Mathematics: student does not require special education and related services, the radius of each other, indistinguishable. Man said, ac organized the program. The purpose of this report provides information about the child s educational performance. Other pertinent future work includes the extension of our projective lloyd patterns against other patterns on a role not based on his scores on this scale is different for the sake of visual clarity, we specify all spaces via a set x. In a way, man, i undid it just so much that a single…

Other Implementation Details

When combining the texts, it might make sense to “normalize” the percentages for each source text. How it works now with raw counts makes it so longer documents have more of their style preserved in the final output document.

You may also want to give weightings to different text so you can have a sliding scale between Poe and Asimov for instance, by basically scaling the counts from their files higher or lower to give more or less representation in the results.

When analyzing the text, I had to think about what to do with punctuation. I chose to treat punctuation as words in themselves, but ignored some punctuation that was giving weird results – like double quotes. I’ve only just now realized that I incorrectly ignore question marks. Oops.

When generating text, i made it so some words don’t put a space before themselves (like, a period!), and i also made it so words would have their first letter capitalized after a period or similar. There seems to need ad hoc, domain specific massaging to get reasonable results.

It’s possible (especially with higher order markov chains) that you can get into a situation where your current state has nothing to transition to. You’d have to figure out what to do in this case. One idea would be to choose a next word at random. Another idea would be to fall back to a lower order markov chain maybe?

I feel like once you understand the algorithm, it’s an art form to teach and tune the Markov chain to get good results. I bet there are some interesting techniques beyond the simple things I’ve done here.

Links

Mathy Markov Chain Info

If you want to dive into the mathy side of markov chains, here are some great resources you can follow to get there…

A great linear algebra online “text book”, that is very easy to read and understand: http://immersivemath.com/ila/index.html

Some great videos on linear algebra: https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

A 9 part series on markov chains. It’s this long because it’s very explicit and works through the details by hand. I watched it at like 1.5x speed and was fine ๐Ÿ˜›

Some “mathy” notes about Markov chains, including higher order ones:

Click to access chap4.pdf

Q Learning

Related to markov chains, Q learning is essentially is a way to learn a Markov chain from data – for instance learning how to play tic tac toe, or how to traverse a maze.

I would like to learn Q learning better and make a post (and code!) at some point.

Q Learning Explained With HTML5
https://blockulator.github.io/Q-Learning-Explained-With-HTML5/

An introduction to Q-Learning: reinforcement learning
https://medium.freecodecamp.org/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc

Reinforcement Learning Tutorial Part 1: Q-Learning
https://blog.valohai.com/reinforcement-learning-tutorial-part-1-q-learning

Reinforcement Learning Tutorial Part 2: Cloud Q-learning
https://blog.valohai.com/reinforcement-learning-tutorial-cloud-q-learning

Reinforcement Learning Tutorial Part 3: Basic Deep Q-Learning
https://towardsdatascience.com/reinforcement-learning-tutorial-part-3-basic-deep-q-learning-186164c3bf4

Other

Here is a twitter conversation about some compelling uses of Markov chains

Here’s a video “Markov Chain Monte Carlo and the Metropolis Algorithm” which uses Markov chains to help calculate integrals numerically.

Code

Again, the code for this post is up on github at https://github.com/Atrix256/TextMarkovChain

The code is written for readability and runs plenty fast for this demo (nearly instant in release, a couple seconds in debug) but There are lots of string copies etc that you would want to fix up if using this code seriously.

Thanks for reading!