Markov Chain Monte Carlo (MCMC)

a type of approximate probabilistic inference that uses dependent sampling as a way to approximate the distribution of a complex target distribution 𝜋
is a method for obtaining a sequence of random samples which converge to being distributed according to a target probability distribution for which direct sampling is difficult
based on the theory of Markov Chains and Simulations

MCMC - Theory

Problem:

approximate or sample from target distribution 𝜋

Solution:

Markov Chain idea: given an ergodic transition matrix 𝑇 there exists a stationary distribution 𝜋
MCMC idea: given a target distribution 𝜋 construct an ergodic transition matrix 𝑇 that will produce 𝜋
- the ergodic theorem states that sampling from this Markov chain 𝑇 will approximate the target distribution 𝜋
- Click here to expand...
  - 𝑙𝑖𝑚_𝑡→∞[𝜂(𝑖,𝑡)/𝑡] = 𝜋(𝑖)
  where:
  - 𝑡 - the number of steps
  - 𝑖 - a state in the Markov chain
  - 𝜂(𝑖,𝑡) - the number of visits to state 𝑖 over a period of 𝑡 steps
  - 𝜋(𝑖) > 0 - the stationary distribution value for state 𝑖

MCMC assumes there is some transition matrix 𝑇 that assigns probabilities from going from one state to another

a necessary condition of 𝑇 is that there must exist a vector 𝜋 that satisfies stationary distribution:

𝜋(𝑥_𝑗) = 𝛴_{𝑥_𝑖}𝜋(𝑥_𝑖)𝑇(𝑥_𝑖 → 𝑥_𝑗)
𝜋 = 𝜋𝑇 # matrix vector form

example stationary distribution

𝜋(𝑥_𝑗) = 𝜋(𝑥₀)𝑇(𝑥₀ → 𝑥_𝑗) + 𝜋(𝑥₁)𝑇(𝑥₁ → 𝑥_𝑗)

for 𝑗=0:

𝜋(𝑥₀) ≈ 𝜋(𝑥₀)𝑇(𝑥₀ → 𝑥₀) + 𝜋(𝑥₁)𝑇(𝑥₁ → 𝑥₀)

0.571 ≈ 0.571 * 0.7 + 0.428 * 0.4

0.571 ≈ 0.5709

for 𝑗=1:

𝜋(𝑥₁) ≈ 𝜋(𝑥₀)𝑇(𝑥₀ → 𝑥₁) + 𝜋(𝑥₁)𝑇(𝑥₁ → 𝑥₁)

0.428 ≈ 0.571 * 0.3 + 0.428 * 0.6

0.428 ≈ 0.4281

Suppose target distribution is a joint distribution over 𝑛 variables 𝐏(𝑋₁, …, 𝑋_𝑛)

In the Markov Chain state-space each state is a complete assignment 𝒙 to 𝑿 = {𝑋₁, …, 𝑋_𝑛}

we traverse the state-space 𝑡 times, thus getting 𝑡 samples {𝒙⁽¹⁾, …, 𝒙^(𝑡)}:

𝒙⁽¹⁾ ~ 𝑇(𝒙⁽⁰⁾ → 𝒙⁽¹⁾)
𝒙⁽²⁾ ~ 𝑇(𝒙⁽¹⁾ → 𝒙⁽²⁾)
…
𝒙^(𝑡) ~ 𝑇(𝒙^(𝑡-1) → 𝒙^(𝑡))

MCMC gives approximate correlated samples:

𝔼_𝐏[𝑓] ≈ (1/𝑡) 𝛴_{1≤𝑖≤𝑡} 𝑓(𝒙^(𝑖))

at a high level, the Markov Chain is defined in terms of a graph of states over which the sampling algorithm takes a random walk. In the case of graphical models, this graph is not the original graph, but rather whose nodes are possible are the possible assignments to our variables {𝑋₁, …, 𝑋_𝑛}. A transition model 𝑇 specifies for each pair of states (𝒙, 𝒙’) the probability 𝑇(𝒙 → 𝒙’) of going from state 𝒙 to 𝒙’.

this defines a random sequence of states: {𝒙⁽¹⁾, …, 𝒙^(𝑡)}. Using chain dynamics we can define distributions over subsequent states:

𝐏^(𝑖+1)(𝑋^(𝑖+1)=𝒙’) = 𝛴_𝒙[𝐏^(𝑖)(𝑋^(𝑖)=𝒙)𝑇(𝒙 → 𝒙’)]

intuitively, the probability of being at state 𝒙’ at time 𝑖+1 is the sum over all possible states 𝒙 that the chain could be at time 𝑖 of the probability of being in state 𝒙 times the probability that the chain took a transition from 𝒙 to 𝒙’

as the process converges, we would expect 𝐏^(𝑖+1) to be close to 𝐏^(𝑖):

𝐏^(𝑖)(𝒙’) = 𝐏^(𝑖+1)(𝒙’) = 𝛴_𝒙[𝐏^(𝑖)(𝒙)𝑇(𝒙 → 𝒙’)]

at convergence, we expect the resulting distribution 𝜋(𝑿) to be an equilibrium relative to the transition model (i.e. the probability of being in a state is the same as the probability of transitioning into it from a randomly sampled predecessor)

𝜋(𝑿=𝒙’) = 𝛴_𝒙𝜋(𝑿=𝒙)𝑇(𝒙 → 𝒙’)

𝜋(𝑿) is a stationary distribution for a Markov Chain 𝑇

MCMC - Transition Models

Click here to expand...

we often define a set of transition models {𝑇₁, 𝑇_𝑘} where each transition model 𝑇_𝑖 is called a kernel.

In certain cases, the variety or multiplicity of kernels:

is necessary, because no single kernel on its own suffices to ensure regularity

makes the state-space more connected and therefore speeds up the convergence to a stationary distribution

Methods in Constructing a Markov Chain From Multiple Transition Models

simply select a random 𝑇_𝑖 from {𝑇₁, 𝑇_𝑘} at each step

simply cycle through over different transition models at each step

in the case of graphical models, we could define a multi-kernel chain, where we have a kernel 𝑇_𝑖 for each variable/node. Thus 𝑇_𝑖 takes state (𝒙_-𝑖, 𝑥_𝑖) and transitions to a state of the form (𝒙_-𝑖, 𝑥_𝑖’). This method is called Gibbs Sampling

Link to original

MCMC - Implementations

Suppose target distribution is 𝐏(𝑋₁, …, 𝑋_𝑛)

Implementation	Transition Function
Gibbs Sampling	based on sampling from 𝐏(𝑋_𝑖\|𝒙_-𝑖) for all 𝑥_𝑖 Click here to expand... ₁, ..., 𝑋_𝑛) 𝐏(𝑥) = 𝐏(𝑥_𝑖\|𝒙_-𝑖)𝐏(𝒙_-𝑖) where: 𝑥 - is all the variables 𝑥_𝑖 - is a single variable 𝑥_-𝑖 - is all variables in 𝑥 minus 𝑥_𝑖 for Systematic Gibbs Sampling, we sample from 𝐏(𝑋_𝑖\|𝒙_-𝑖) for all 𝑥_𝑖 in 𝑥. After that, we have a sample {𝑥₁, …, 𝑥_𝑛} Therefore the transition function 𝑇(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑛𝑒𝑥𝑡}) ≈ 𝐏(𝑥^{𝑛𝑒𝑥𝑡}\|𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}) 𝑥₁^{𝑛𝑒𝑥𝑡} ~ 𝐏(𝑋₁\|𝑥₂^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}, …, 𝑥_𝑛^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}) 𝑥₂^{𝑛𝑒𝑥𝑡} ~ 𝐏(𝑋₂\|𝑥₁^{𝑛𝑒𝑥𝑡}, 𝑥₃^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}, …, 𝑥_𝑛^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}) … 𝑥_𝑛^{𝑛𝑒𝑥𝑡} ~ 𝐏(𝑋_𝑛\|𝑥₁^{𝑛𝑒𝑥𝑡}, …, 𝑥_𝑛-1^{𝑛𝑒𝑥𝑡}) Systematic Gibbs Sampling satisfies stationary distribution: 𝜋(𝑥^{𝑛𝑒𝑥𝑡}) = 𝛴_{_{𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}∊𝐴𝐿𝐿-𝑆𝑇𝐴𝑇𝐸𝑆}}𝜋(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡})𝑇(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑛𝑒𝑥𝑡}) 𝐏(𝑥^{𝑛𝑒𝑥𝑡}) = 𝛴_{_{𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}∊𝐴𝐿𝐿-𝑆𝑇𝐴𝑇𝐸𝑆}}𝐏(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡})𝐏(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑛𝑒𝑥𝑡}) 𝐏(𝑥^{𝑛𝑒𝑥𝑡}) = 𝛴_{_{𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}∊𝐴𝐿𝐿-𝑆𝑇𝐴𝑇𝐸𝑆}}𝐏(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡})𝐏(𝑥^{𝑛𝑒𝑥𝑡}\|𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}) Suppose target distribution is 𝐏(𝑋
Gibbs Sampling	more efficient than Metropolis-Hastings Algorithm as it accepts all proposals however, it takes local steps that are very near by, as compared to Metropolis-Hastings (MH) Algorithm requires us to know and sample from the conditional distribution 𝐏(𝑋_𝑖\|𝒙_-𝑖)) slow when compared to Metropolis-Hastings (MH) Algorithm when a subset of variables in {𝑋₁, …, 𝑋_𝑛} are correlated
Metropolis-Hastings (MH) Algorithm	based on sampling from proposal distribution 𝐐(𝒙 → 𝒙’) and accepting these proposals with probability 𝐀(𝒙 → 𝒙’) Click here to expand... proposal distribution 𝐐(𝒙 → 𝒙’) acceptance probability: 𝐀(𝒙 → 𝒙’) algorithm: at each state 𝒙 sample next state 𝒙’ from 𝐐(𝒙 → 𝒙’) accept next state 𝒙’ with probability 𝐀(𝒙 → 𝒙’) if accepted, move to 𝒙’ otherwise stay at 𝒙 Therefore the transition function 𝑇 if 𝒙 ≠ 𝒙’ 𝑇(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑛𝑒𝑥𝑡}) = 𝐐(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑛𝑒𝑥𝑡})𝐀(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑛𝑒𝑥𝑡}) if 𝒙 = 𝒙’ 𝑇(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}) = 𝐐(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡}) + 𝛴_{𝑥≠𝒙’} [𝐐(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑛𝑒𝑥𝑡})(1 - 𝐀(𝑥^{𝑐𝑢𝑟𝑟𝑒𝑛𝑡} → 𝑥^{𝑛𝑒𝑥𝑡}))] construct 𝐀 such that detailed balance holds for 𝐐: 𝜋(𝑥’)𝑇(𝑥’ → 𝑥) = 𝜋(𝑥)𝑇(𝑥 → 𝑥’) 𝜋(𝑥’)𝐐(𝑥’ → 𝑥)𝐀(𝑥’ → 𝑥) = 𝜋(𝑥)𝐐(𝑥 → 𝑥’)𝐀(𝑥 → 𝑥’) 𝐀(𝑥 → 𝑥’) / 𝐀(𝑥’ → 𝑥) = 𝜋(𝑥’)𝐐(𝑥’ → 𝑥) / 𝜋(𝑥)𝐐(𝑥 → 𝑥’) 𝐀(𝑥 → 𝑥’) = 𝑚𝑖𝑛(1, [𝜋(𝑥’)𝐐(𝑥’ → 𝑥) / 𝜋(𝑥)𝐐(𝑥 → 𝑥’)])
Metropolis-Hastings (MH) Algorithm	takes steps based on proposal distribution 𝐐(𝒙 → 𝒙’), allowing MH to take larger steps in the state space when compared to Gibbs Sampling. However, it is less efficient than Gibbs Sampling as it rejects/accepts proposal steps does NOT require us to know and sample from the conditional distribution 𝐏(𝑋_𝑖\|𝒙_-𝑖))
Hamiltonian Monte Carlo	TODO

MCMC - Examples

Simple Example

Subpages

Importance Sampling (IS) vs Monte Carlo Markov Chains (MCMC)

Resources

https://statswithr.github.io/book/stochastic-explorations-using-mcmc.html

／var／log marcus chiu

Explorer

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC)

MCMC - Theory

MCMC - Transition Models

Methods in Constructing a Markov Chain From Multiple Transition Models

MCMC - Implementations

MCMC - Examples

Subpages

Resources

／var／logmarcus chiu

Explorer

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC)

MCMC - Theory

MCMC - Transition Models

Methods in Constructing a Markov Chain From Multiple Transition Models

MCMC - Implementations

MCMC - Examples

Subpages

Resources

／var／log marcus chiu