Markov Chain Monte Carlo (MCMC)
- a type of approximate probabilistic inference that uses dependent sampling as a way to approximate the distribution of a complex target distribution 𝜋
- is a method for obtaining a sequence of random samples which converge to being distributed according to a target probability distribution for which direct sampling is difficult
- based on the theory of Markov Chains and Simulations
MCMC - Theory
Problem:
- approximate or sample from target distribution 𝜋
Solution:
- Markov Chain idea: given an ergodic transition matrix 𝑇 there exists a stationary distribution 𝜋
- MCMC idea: given a target distribution 𝜋 construct an ergodic transition matrix 𝑇 that will produce 𝜋
- the ergodic theorem states that sampling from this Markov chain 𝑇 will approximate the target distribution 𝜋
-
Click here to expand...
- 𝑙𝑖𝑚𝑡→∞[𝜂(𝑖,𝑡)/𝑡] = 𝜋(𝑖)
where:
- 𝑡 - the number of steps
- 𝑖 - a state in the Markov chain
- 𝜂(𝑖,𝑡) - the number of visits to state 𝑖 over a period of 𝑡 steps
- 𝜋(𝑖) > 0 - the stationary distribution value for state 𝑖
MCMC assumes there is some transition matrix 𝑇 that assigns probabilities from going from one state to another
a necessary condition of 𝑇 is that there must exist a vector 𝜋 that satisfies stationary distribution:
- 𝜋(𝑥𝑗) = 𝛴𝑥𝑖𝜋(𝑥𝑖)𝑇(𝑥𝑖 → 𝑥𝑗)
- 𝜋 = 𝜋𝑇 # matrix vector form
example stationary distribution
- 𝜋(𝑥𝑗) = 𝜋(𝑥0)𝑇(𝑥0 → 𝑥𝑗) + 𝜋(𝑥1)𝑇(𝑥1 → 𝑥𝑗)
for 𝑗=0:
- 𝜋(𝑥0) ≈ 𝜋(𝑥0)𝑇(𝑥0 → 𝑥0) + 𝜋(𝑥1)𝑇(𝑥1 → 𝑥0)
- 0.571 ≈ 0.571 * 0.7 + 0.428 * 0.4
- 0.571 ≈ 0.5709
for 𝑗=1:
- 𝜋(𝑥1) ≈ 𝜋(𝑥0)𝑇(𝑥0 → 𝑥1) + 𝜋(𝑥1)𝑇(𝑥1 → 𝑥1)
- 0.428 ≈ 0.571 * 0.3 + 0.428 * 0.6
- 0.428 ≈ 0.4281
Suppose target distribution is a joint distribution over 𝑛 variables 𝐏(𝑋1, …, 𝑋𝑛)
In the Markov Chain state-space each state is a complete assignment 𝒙 to 𝑿 = {𝑋1, …, 𝑋𝑛}
we traverse the state-space 𝑡 times, thus getting 𝑡 samples {𝒙(1), …, 𝒙(𝑡)}:
- 𝒙(1) ~ 𝑇(𝒙(0) → 𝒙(1))
- 𝒙(2) ~ 𝑇(𝒙(1) → 𝒙(2))
- …
- 𝒙(𝑡) ~ 𝑇(𝒙(𝑡-1) → 𝒙(𝑡))
MCMC gives approximate correlated samples:
- 𝔼𝐏[𝑓] ≈ (1/𝑡) 𝛴1≤𝑖≤𝑡 𝑓(𝒙(𝑖))
at a high level, the Markov Chain is defined in terms of a graph of states over which the sampling algorithm takes a random walk. In the case of graphical models, this graph is not the original graph, but rather whose nodes are possible are the possible assignments to our variables {𝑋1, …, 𝑋𝑛}. A transition model 𝑇 specifies for each pair of states (𝒙, 𝒙’) the probability 𝑇(𝒙 → 𝒙’) of going from state 𝒙 to 𝒙’.
this defines a random sequence of states: {𝒙(1), …, 𝒙(𝑡)}. Using chain dynamics we can define distributions over subsequent states:
- 𝐏(𝑖+1)(𝑋(𝑖+1)=𝒙’) = 𝛴𝒙[𝐏(𝑖)(𝑋(𝑖)=𝒙)𝑇(𝒙 → 𝒙’)]
intuitively, the probability of being at state 𝒙’ at time 𝑖+1 is the sum over all possible states 𝒙 that the chain could be at time 𝑖 of the probability of being in state 𝒙 times the probability that the chain took a transition from 𝒙 to 𝒙’
as the process converges, we would expect 𝐏(𝑖+1) to be close to 𝐏(𝑖):
- 𝐏(𝑖)(𝒙’) = 𝐏(𝑖+1)(𝒙’) = 𝛴𝒙[𝐏(𝑖)(𝒙)𝑇(𝒙 → 𝒙’)]
at convergence, we expect the resulting distribution 𝜋(𝑿) to be an equilibrium relative to the transition model (i.e. the probability of being in a state is the same as the probability of transitioning into it from a randomly sampled predecessor)
- 𝜋(𝑿=𝒙’) = 𝛴𝒙𝜋(𝑿=𝒙)𝑇(𝒙 → 𝒙’)
𝜋(𝑿) is a stationary distribution for a Markov Chain 𝑇
MCMC - Transition Models
Click here to expand...
Link to originalwe often define a set of transition models {𝑇1, 𝑇𝑘} where each transition model 𝑇𝑖 is called a kernel.
In certain cases, the variety or multiplicity of kernels:
- is necessary, because no single kernel on its own suffices to ensure regularity
- makes the state-space more connected and therefore speeds up the convergence to a stationary distribution
Methods in Constructing a Markov Chain From Multiple Transition Models
- simply select a random 𝑇𝑖 from {𝑇1, 𝑇𝑘} at each step
- simply cycle through over different transition models at each step
in the case of graphical models, we could define a multi-kernel chain, where we have a kernel 𝑇𝑖 for each variable/node. Thus 𝑇𝑖 takes state (𝒙-𝑖, 𝑥𝑖) and transitions to a state of the form (𝒙-𝑖, 𝑥𝑖’). This method is called Gibbs Sampling
MCMC - Implementations
Suppose target distribution is 𝐏(𝑋1, …, 𝑋𝑛)
|
Implementation |
Transition Function |
|---|---|
|
based on sampling from 𝐏(𝑋𝑖|𝒙-𝑖) for all 𝑥𝑖
| |
| |
|
based on sampling from proposal distribution 𝐐(𝒙 → 𝒙’) and accepting these proposals with probability 𝐀(𝒙 → 𝒙’)
| |
| |
|
TODO |
MCMC - Examples
Simple Example
-methods)-/markov-chain-monte-carlo-(mcmc)/stationary-distribution-example.png)
-methods)-/markov-chain-monte-carlo-(mcmc)/mcmc-simple-example.png)