In statistics and statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult. New samples are added to the sequence in two steps: first a new sample is proposed based on the previous sample, then the proposed sample is either added to the sequence or rejected depending on the value of the probability distribution at that point. The resulting sequence can be used to approximate the distribution (e.g. to generate a histogram) or to compute an integral (e.g. an expected value).
Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, there are usually other methods (e.g. adaptive rejection sampling) that can directly return independent samples from the distribution, and these are free from the problem of autocorrelated samples that is inherent in MCMC methods.
The algorithm is named in part for Nicholas Metropolis, the first coauthor of a 1953 paper, entitled Equation of State Calculations by Fast Computing Machines, with Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller and Edward Teller. For many years the algorithm was known simply as the Metropolis algorithm.[1] [2] The paper proposed the algorithm for the case of symmetrical proposal distributions, but in 1970, W.K. Hastings extended it to the more general case. The generalized method was eventually identified by both names, although the first use of the term "Metropolis-Hastings algorithm" is unclear.
Some controversy exists with regard to credit for development of the Metropolis algorithm. Metropolis, who was familiar with the computational aspects of the method, had coined the term "Monte Carlo" in an earlier article with Stanisław Ulam, and led the group in the Theoretical Division that designed and built the MANIAC I computer used in the experiments in 1952. However, prior to 2003 there was no detailed account of the algorithm's development. Shortly before his death, Marshall Rosenbluth attended a 2003 conference at LANL marking the 50th anniversary of the 1953 publication. At this conference, Rosenbluth described the algorithm and its development in a presentation titled "Genesis of the Monte Carlo Algorithm for Statistical Mechanics". Further historical clarification is made by Gubernatis in a 2005 journal article recounting the 50th anniversary conference. Rosenbluth makes it clear that he and his wife Arianna did the work, and that Metropolis played no role in the development other than providing computer time.
This contradicts an account by Edward Teller, who states in his memoirs that the five authors of the 1953 article worked together for "days (and nights)". In contrast, the detailed account by Rosenbluth credits Teller with a crucial but early suggestion to "take advantage of statistical mechanics and take ensemble averages instead of following detailed kinematics". This, says Rosenbluth, started him thinking about the generalized Monte Carlo approach – a topic which he says he had discussed often with John Von Neumann. Arianna Rosenbluth recounted (to Gubernatis in 2003) that Augusta Teller started the computer work, but that Arianna herself took it over and wrote the code from scratch. In an oral history recorded shortly before his death, Rosenbluth again credits Teller with posing the original problem, himself with solving it, and Arianna with programming the computer.
P(x)
f(x)
P
f(x)
f(x)
The Metropolis–Hastings algorithm generates a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution. These sample values are produced iteratively in such a way, that the distribution of the next sample depends only on the current sample value, which makes the sequence of samples a Markov chain. Specifically, at each iteration, the algorithm proposes a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted, in which case the candidate value is used in the next iteration, or it is rejected in which case the candidate value is discarded, and the current value is reused in the next iteration. The probability of acceptance is determined by comparing the values of the function
f(x)
The method used to propose new candidates is characterized by the probability distribution
g(x\midy)
Q(x\midy)
x
y
g(x\midy)
y
y
g(x\midy)
y
For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below.
f(x)
P(x)
xt
g(x\midy)
g
g(x\midy)=g(y\midx)
x'
g(x'\midxt)
\alpha=f(x')/f(xt)
\alpha=f(x')/f(xt)=P(x')/P(xt)
u\in[0,1]
u\le\alpha
xt+1=x'
u>\alpha
xt+1=xt
This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio
\alpha
P(x)
P(x)
\alpha>1\geu
P(x)
P(x)
Compared with an algorithm like adaptive rejection sampling[3] that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages:
P(x)
On the other hand, most simple rejection sampling methods suffer from the "curse of dimensionality", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from hierarchical Bayesian models and other high-dimensional statistical models used nowadays in many disciplines.
In multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the suitable jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as Gibbs sampling, involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. That way, the problem of sampling from potentially high-dimensional space will be reduced to a collection of problems to sample from small dimensionality.[5] This is especially applicable when the multivariate distribution is composed of a set of individual random variables in which each variable is conditioned on only a small number of other variables, as is the case in most typical hierarchical models. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are the adaptive rejection sampling methods, the adaptive rejection Metropolis sampling algorithm,[6] a simple one-dimensional Metropolis–Hastings step, or slice sampling.
The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a desired distribution
P(x)
\pi(x)
\pi(x)=P(x)
A Markov process is uniquely defined by its transition probabilities
P(x'\midx)
x
x'
\pi(x)
\pi(x)
x\tox'
x,x'
x
x'
x'
x
\pi(x)P(x'\midx)=\pi(x')P(x\midx')
\pi(x)
The Metropolis–Hastings algorithm involves designing a Markov process (by constructing transition probabilities) that fulfills the two above conditions, such that its stationary distribution
\pi(x)
P(x)
P(x'\midx)P(x)=P(x\midx')P(x'),
which is re-written as
P(x'\midx) | |
P(x\midx') |
=
P(x') | |
P(x) |
.
The approach is to separate the transition in two sub-steps; the proposal and the acceptance-rejection. The proposal distribution
g(x'\midx)
x'
x
A(x',x)
x'
P(x'\midx)=g(x'\midx)A(x',x).
Inserting this relation in the previous equation, we have
A(x',x) | |
A(x,x') |
=
P(x') | |
P(x) |
g(x\midx') | |
g(x'\midx) |
.
The next step in the derivation is to choose an acceptance ratio that fulfills the condition above. One common choice is the Metropolis choice:
A(x',x)=min\left(1,
P(x') | |
P(x) |
g(x\midx') | |
g(x'\midx) |
\right).
For this Metropolis acceptance ratio
A
A(x',x)=1
A(x,x')=1
The Metropolis–Hastings algorithm can thus be written as follows:
x0
t=0
x'
g(x'\midxt)
A(x',xt)=min\left(1,
P(x') | |
P(xt) |
g(xt\midx') | |
g(x'\midxt) |
\right)
u\in[0,1]
u\leA(x',xt)
xt+1=x'
u>A(x',xt)
xt+1=xt
t=t+1
Provided that specified conditions are met, the empirical distribution of saved states
x0,\ldots,xT
P(x)
T
P(x)
P(x)
It is important to notice that it is not clear, in a general problem, which distribution
g(x'\midx)
See main article: Monte Carlo integration.
A common use of Metropolis–Hastings algorithm is to compute an integral. Specifically, consider a space
\Omega\subsetR
P(x)
\Omega
x\in\Omega
P(E)=\int\OmegaA(x)P(x)dx,
where
A(x)
E(x)
P(E)
P(E)
E
P(E)
P(E)
P(E)=\int\OmegaP(E\midx)P(x)dx=\int\Omega\delta(E-E(x))P(x)dx=E(P(E\midX))
P(E)
AE(x)\equiv1E(x)
E(x)\in[E,E+\DeltaE]
E
P(E)
x
E(x)
P(E)
P(E)
P(E)
\pi(x)
\pi(x)\proptoea
a>0
Suppose that the most recent value sampled is
xt
x'
g(x'\midxt)
a=a1a2,
where
a1=
P(x') | |
P(xt) |
is the probability (e.g., Bayesian posterior) ratio between the proposed sample
x'
xt
a2=
g(xt\midx') | |
g(x'\midxt) |
is the ratio of the proposal density in two directions (from
xt
x'
xt+1
If
a\geq1{:}
xt+1=x',
else:
xt+1= \begin{cases} x'&withprobabilitya,\\ xt&withprobability1-a. \end{cases}
The Markov chain is started from an arbitrary initial value
x0
x
P(x)
The algorithm works best if the proposal density matches the shape of the target distribution
P(x)
g(x'\midxt) ≈ P(x')
g
\sigma2
N
N
If
\sigma2
P(x)
\sigma2
a1
See main article: article and Bayesian Inference. MCMC can be used to draw samples from the posterior distribution of a statistical model.The acceptance probability is given by:
Pacc(\thetai\to\theta*)=min\left(1,
l{L | |
(y|\theta |
*)P(\theta
*)}{l{L}(y|\theta | |
i)P(\theta |
|
\right),
l{L}
P(\theta)
Q