Markov Chain Monte Carlo (MCMC)
  • a type of approximate probabilistic inference that uses dependent sampling as a way to approximate the distribution of a complex target distribution 𝜋
  • is a method for obtaining a sequence of random samples which converge to being distributed according to a target probability distribution for which direct sampling is difficult
  • based on the theory of Markov Chains and Simulations

MCMC - Theory

Problem:

  • approximate or sample from target distribution 𝜋

Solution:

  • Markov Chain idea: given an ergodic transition matrix 𝑇 there exists a stationary distribution 𝜋
  • MCMC idea: given a target distribution 𝜋 construct an ergodic transition matrix 𝑇 that will produce 𝜋
    • the ergodic theorem states that sampling from this Markov chain 𝑇 will approximate the target distribution 𝜋

MCMC assumes there is some transition matrix 𝑇 that assigns probabilities from going from one state to another

a necessary condition of 𝑇 is that there must exist a vector 𝜋 that satisfies stationary distribution:

  • 𝜋(𝑥𝑗) = 𝛴𝑥𝑖𝜋(𝑥𝑖)𝑇(𝑥𝑖𝑥𝑗)
  • 𝜋 = 𝜋𝑇 # matrix vector form

Suppose target distribution is a joint distribution over 𝑛 variables 𝐏(𝑋1, …, 𝑋𝑛)

In the Markov Chain state-space each state is a complete assignment 𝒙 to 𝑿 = {𝑋1, …, 𝑋𝑛}

we traverse the state-space 𝑡 times, thus getting 𝑡 samples {𝒙(1), …, 𝒙(𝑡)}:

  • 𝒙(1) ~ 𝑇(𝒙(0) → 𝒙(1))
  • 𝒙(2) ~ 𝑇(𝒙(1) → 𝒙(2))
  • 𝒙(𝑡) ~ 𝑇(𝒙(𝑡-1) → 𝒙(𝑡))

MCMC gives approximate correlated samples:

  • 𝔼𝐏[𝑓] ≈ (1/𝑡) 𝛴1≤𝑖≤𝑡 𝑓(𝒙(𝑖))

at a high level, the Markov Chain is defined in terms of a graph of states over which the sampling algorithm takes a random walk. In the case of graphical models, this graph is not the original graph, but rather whose nodes are possible are the possible assignments to our variables {𝑋1, …, 𝑋𝑛}. A transition model 𝑇 specifies for each pair of states (𝒙, 𝒙’) the probability 𝑇(𝒙 → 𝒙’) of going from state 𝒙 to 𝒙’.

this defines a random sequence of states: {𝒙(1), …, 𝒙(𝑡)}. Using chain dynamics we can define distributions over subsequent states:

  • 𝐏(𝑖+1)(𝑋(𝑖+1)=𝒙’) = 𝛴𝒙[𝐏(𝑖)(𝑋(𝑖)=𝒙)𝑇(𝒙 → 𝒙’)]

intuitively, the probability of being at state 𝒙’ at time 𝑖+1 is the sum over all possible states 𝒙 that the chain could be at time 𝑖 of the probability of being in state 𝒙 times the probability that the chain took a transition from 𝒙 to 𝒙’

as the process converges, we would expect 𝐏(𝑖+1) to be close to 𝐏(𝑖):

  • 𝐏(𝑖)(𝒙’) = 𝐏(𝑖+1)(𝒙’) = 𝛴𝒙[𝐏(𝑖)(𝒙)𝑇(𝒙 → 𝒙’)]

at convergence, we expect the resulting distribution 𝜋(𝑿) to be an equilibrium relative to the transition model (i.e. the probability of being in a state is the same as the probability of transitioning into it from a randomly sampled predecessor)

  • 𝜋(𝑿=𝒙’) = 𝛴𝒙𝜋(𝑿=𝒙)𝑇(𝒙 → 𝒙’)

𝜋(𝑿) is a stationary distribution for a Markov Chain 𝑇

MCMC - Transition Models

MCMC - Implementations

Suppose target distribution is 𝐏(𝑋1, …, 𝑋𝑛)

Implementation

Transition Function

Gibbs Sampling

based on sampling from 𝐏(𝑋𝑖|𝒙-𝑖) for all 𝑥𝑖

Metropolis-Hastings (MH) Algorithm

based on sampling from proposal distribution 𝐐(𝒙 → 𝒙’) and accepting these proposals with probability 𝐀(𝒙 → 𝒙’)

  • takes steps based on proposal distribution 𝐐(𝒙 → 𝒙’), allowing MH to take larger steps in the state space when compared to Gibbs Sampling.
  • However, it is less efficient than Gibbs Sampling as it rejects/accepts proposal steps
  • does NOT require us to know and sample from the conditional distribution 𝐏(𝑋𝑖|𝒙-𝑖))

Hamiltonian Monte Carlo

TODO

MCMC - Examples

Subpages

Resources