probabilistic inference recap

given a probabilistic model, Probabilistic Inference is answering a probabilistic query over that model. It is also used for the computation/estimation of: distributions, the distribution’s parameters (if it’s a parametric distribution), the distribution’s probabilities, and/or the distribution’s characteristics

syntax definition

Click here to expand...

A graphical model 𝒢 is a tuple 𝒢 = ⟨𝐗, 𝐃, 𝐒, 𝐅, 𝐂⟩ where:

𝐗 = {𝑋₁, …, 𝑋_𝑛} set of ordered variables

𝐃 = {𝐷₁, …, 𝐷_𝑛} set of corresponding domains of each variable 𝑋_𝑖 (e.g. if 𝑋₁is a boolean variable then 𝐷₁= {true, false}). The size of each 𝐷_𝑖corresponds to the cardinality of variable 𝑋_𝑖

𝐒 = {𝑆₁, …, 𝑆_𝑚} set of variable scopes, where each variable scope 𝑆_𝑖is a subset of 𝐗 (i.e. 𝑆_𝑖 ⊆ 𝐗)

𝐅 = {𝐹₁, …, 𝐹_𝑚} set of factors/functions, where each factor/function 𝐹_𝑖 is defined over its corresponding variable scope 𝑆_𝑖and maps any assignment over its scope to a real value

in context of Bayesian Networks, 𝐅 = set of conditional probability tables

in context of Markov Networks, 𝐅 = set of factors

global function - is a function whose scope includes all variables (i.e. 𝑆_𝑖 = 𝐗)

local functions - is a function whose scope is a proper subset of variables (i.e. 𝑆_𝑖⊂ 𝐗)

𝐂 is a set of combination operators which defines how functions are combined. common combination operators are:

summation operator (𝛴)

multiplication operator (𝛱)

AND operator (∧) - for Boolean functions

relational join operator (⨝) - when the functions are relation

marginalization operator - for reasoning queries

max operator - e.g. = argmax_𝑦[ 𝐹_𝑖(𝑥,𝑦) ] = 𝐹_𝑗(𝑥) where 𝐹_𝑗is a new function with scope over variable 𝑥

the set of local functions can be combined in a variety of ways (e.g. combination operators) to generate a new local function or even a global function

Link to original

and let:

𝐐 = 𝑄₁, …, 𝑄_𝑟be the set of query variables

𝐄 = 𝐸₁, …, 𝐸_𝑠be the set of evidence variables

𝐇 = 𝐻₁, …, 𝐻_𝑡be the remainder of the variables(called the hidden variables (non-evidence, non-query variables))

𝐪 = 𝑞₁, …, 𝑞_𝑟be a grounded instantiation of 𝐐

𝐞 = 𝑒₁, …, 𝑒_𝑠be a grounded instantiation of 𝐄

𝐡 = 𝘩₁, …, 𝘩_𝑡be a grounded instantiation of 𝐇

𝑋, 𝐸, and 𝑌 are disjoint exhaustive sets of all variables 𝐗

therefore:

the complete set of variables is 𝐗 = 𝐐 ∪ 𝐄 ∪ 𝐇 = {𝑋₁, …, 𝑋_𝑛} = {𝑄₁, …, 𝑄_𝑟, 𝐸₁, …, 𝐸_𝑠, 𝐻₁, …, 𝐻_𝑡}

the complete set of instantiations is 𝒙 = 𝐪 ∪ 𝐞 ∪ 𝐡 = {𝑥₁, …, 𝑥_𝑛} = {𝑞₁, …, 𝑞_𝑟, 𝑒₁, …, 𝑒_𝑠, ℎ₁, …, ℎ_𝑡}

Link to original

representations

Click here to expand...

Parametric Probability Distribution Models - function/model based

Non-Parametric Probability Distribution Models - instance-based

see also: ML - Parametric vs Non-Parametric

Univariate Probability Distribution Models

Continuous Models (Probability Density Functions)

Discrete Models (Probability Mass Functions)

Multivariate Probability Distribution Models

Probabilistic Graphical Models (PGM)

Multi-Model Probability Distribution

representations of global structures

Probabilistic Graphical Models (PGM)

representations of local structures

representations of normalized distributions:

Conditional Probability Distributions (CPD)

see: CPD Representations

representations of un-normalized distributions:

Potential Functions

Link to original

probability query types

Click here to expand...

product comes from a joint probability 𝐏(𝐐=𝐪, 𝐇=𝐡, 𝐄=𝐞) being factorized into a product of conditional probabilities

when we take the log of products it becomes sum of factors

Query

Product
Sum

Marginalization𝛴

Maximization𝑎𝑟𝑔𝑚𝑎𝑥

Description

Posterior Conditional Query𝐏(𝐐=𝐪|𝐄=𝐞) = 𝛴_𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡, 𝐄=𝐞) ] / 𝐏(𝐄=𝐞)

Belief Updating Query
(a type of posterior conditional query)𝐏(𝑄_𝑖=𝑞_𝑖|𝐄=𝐞) = 𝛴_𝐡∊𝐇[ 𝐏(𝑄_𝑖=𝑞_𝑖, 𝐇=𝐡, 𝐄=𝐞) ] / 𝐏(𝐄=𝐞)

✔

✔

✘

sum-product problem

sum-log-sums problem

Prior Marginal Query𝐏(𝐐=𝐪) = 𝛴_𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡) ]

Probability of Evidence Query𝐏(𝐄=𝐞) = 𝛴_𝐡∊𝐇[ 𝐏(𝐇=𝐡, 𝐄=𝐞) ]

✔

✔

✘

sum-product problem

sum-log-sums problem

both queries essentially the same where 𝐐 = 𝐄

#P-Hard Problem

Maximum a Posterior (𝑴𝑨𝑷) Query𝑀𝐴𝑃(𝐐=?|𝐄=𝐞) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝐪[ 𝛴_𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡, 𝐄=𝐞) ] ]
where: 𝐐∪𝐄⊂𝐗 and 𝐐∪𝐇∪𝐄=𝐗

✔

✔

✔

max-sum-product problem

max-sum-log-sums problem

most probable assignment because they include maximization

NP^PP-Hard Problem

Most Probable Explanation (𝑴𝑷𝑬) Query𝑀𝑃𝐸(𝐐=?|𝐄=𝐞) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝐪[ 𝐏(𝐐=𝐪, 𝐄=𝐞) ]
where: 𝐐∪𝐄=𝐗

✔

✘

✔

max-product problem

max-log-sums problem

most probable assignment because they include maximization

NP-Hard Problem

more details: Probabilistic Inference - Query／Task Types (Posterior Conditional - Prior Marginal ／ Probability of Evidence - MPE - MAP)

Link to original

resources:

Click here to expand...

Daphne Koller’s Video Lectures

Link to original

Query	Product Sum	Marginalization𝛴	Maximization𝑎𝑟𝑔𝑚𝑎𝑥	Description
Posterior Conditional Query𝐏(𝐐=𝐪\|𝐄=𝐞) = 𝛴_𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡, 𝐄=𝐞) ] / 𝐏(𝐄=𝐞) Belief Updating Query (a type of posterior conditional query)𝐏(𝑄_𝑖=𝑞_𝑖\|𝐄=𝐞) = 𝛴_𝐡∊𝐇[ 𝐏(𝑄_𝑖=𝑞_𝑖, 𝐇=𝐡, 𝐄=𝐞) ] / 𝐏(𝐄=𝐞)	✔	✔	✘	sum-product problem sum-log-sums problem
Prior Marginal Query𝐏(𝐐=𝐪) = 𝛴_𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡) ] Probability of Evidence Query𝐏(𝐄=𝐞) = 𝛴_𝐡∊𝐇[ 𝐏(𝐇=𝐡, 𝐄=𝐞) ]	✔	✔	✘	sum-product problem sum-log-sums problem both queries essentially the same where 𝐐 = 𝐄 #P-Hard Problem
Maximum a Posterior (𝑴𝑨𝑷) Query𝑀𝐴𝑃(𝐐=?\|𝐄=𝐞) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝐪[ 𝛴_𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡, 𝐄=𝐞) ] ] where: 𝐐∪𝐄⊂𝐗 and 𝐐∪𝐇∪𝐄=𝐗	✔	✔	✔	max-sum-product problem max-sum-log-sums problem most probable assignment because they include maximization NP^PP-Hard Problem
Most Probable Explanation (𝑴𝑷𝑬) Query𝑀𝑃𝐸(𝐐=?\|𝐄=𝐞) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝐪[ 𝐏(𝐐=𝐪, 𝐄=𝐞) ] where: 𝐐∪𝐄=𝐗	✔	✘	✔	max-product problem max-log-sums problem most probable assignment because they include maximization NP-Hard Problem

Bucket Tree Elimination (BTE) - Collect & Distribute Algorithm

is an extension of a specific VE Algorithm called 𝐵𝐸-𝑏𝑒𝑙 algorithm, where a node is chosen as root, messages start at leaves passing it up to the root, and then messages are sent back down to leaves
since BE/VE Algorithm induces a cluster tree on which BE/VE Algorithm is performed, BTE is also a type of belief propagation algorithm

The 𝐵𝐸-𝑏𝑒𝑙 algorithm is designed to compute the probability of 𝐏(𝑄_𝑖|𝐄=𝑒), in other words, the belief of 𝑄_𝑖. It is often desirable to also compute the belief query for each and every variable in the network: 𝐏(𝑄₁|𝐄=𝑒), …, 𝐏(𝑄_𝑖-1|𝐄=𝑒), 𝐏(𝑄_𝑖+1|𝐄=𝑒), …, 𝐏(𝑄_𝑛|𝐄=𝑒). A brute-force approach will require running 𝐵𝐸-𝑏𝑒𝑙 algorithm 𝑛 times, one for each variable. We will show next that this is unnecessary. By viewing bucket-elimination as a message passing algorithm along a rooted bucket tree, we can augment it with a second message passing phase in the opposite direction, from root to leaves, and thus every belief query would be computed

From Bucket Elimination to Bucket Tree Elimination

Click here to expand...

consider the Bayesian Network below

Compute Belief in Variable 𝐴

we would like to compute the probabilistic query 𝐏(𝐴|𝐄=𝐞) where:

𝐴 is a single variable denoting the season

𝐄=𝐞 is a set of observed variables assigned with values. we call 𝐄=𝐞 evidence

Figure a shows the process of 𝐵𝐸-𝑏𝑒𝑙 with:

buckets with ordering 𝑑 = {𝐴, 𝐵, 𝐶, 𝐷, 𝐹, 𝐺}

𝜆 messages passed by 𝐵𝐸 from top to bottom

Figure b depicts the same computation as message-passing along a tree which we will refer to as a bucket tree

at the termination of 𝐵𝐸-𝑏𝑒𝑙 both 𝐏(𝐴, 𝐄=𝐞) and 𝐏(𝐄=𝐞) are computed, thus belief in 𝐴 given 𝐄=𝐞 is known via Bayes Rule. 𝐏(𝐴 | 𝐄=𝐞) = 𝐏(𝐴, 𝐄=𝐞) / 𝐏(𝐄=𝐞)

Compute Belief in Variable 𝐷

Now, what if we want to compute 𝐏(𝐷 | 𝐄=𝐞)? We could restart the algorithm using some new arbitrary ordering whose first variable is 𝐷, such as {𝐷, 𝐵, 𝐴, 𝐶, 𝐹, 𝐺}. But rather than doing all the computations from scratch, we can take the existing bucket tree created from the previous computation and reorient the edges to make 𝐷 the root of the tree. Reorienting the tree so that 𝐷 is the root, requires reversing only 2 edges, (𝐵, 𝐷) and (𝐴, 𝐵), suggesting that we only need to recompute messages from the node 𝐴 to 𝐵 and from 𝐵 to 𝐷. By definition, we can compute the belief in 𝐷 by the expression

𝐏(𝐷|𝐄=𝐞) = 𝛴_𝑎∊𝐴𝛴_𝑏∊𝐵 [ 𝐏(𝐴=𝑎) · 𝐏(𝐵=𝑏|𝐴=𝑎) · 𝐏(𝐷=𝑑|𝐴=𝑎,𝐵=𝑏) · 𝜆_𝐶→𝐵(𝑏) ] / 𝐏(𝐄=𝐞)

This computation can be carried also over the bucket tree, whose downward messages were already passed, in three steps:

the first executed in 𝑏𝑢𝑐𝑘𝑒𝑡_𝐴, where the function 𝐏(𝐴) is moved to 𝑏𝑢𝑐𝑘𝑒𝑡_𝐵

𝜋_𝐴→𝐵(𝑎) = 𝐏(𝐴)

the second is executed by 𝑏𝑢𝑐𝑘𝑒𝑡_𝐵, computing a function (a product) that is moved to 𝑏𝑢𝑐𝑘𝑒𝑡_𝐷

𝜋_𝐵→𝐷(𝑎, 𝑏) = 𝐏(𝑏|𝑎) · 𝜋_𝐴→𝐵(𝑎) · 𝜆_𝐶→𝐵(𝑎, 𝑏)

the final computation is carried in 𝑏𝑢𝑐𝑘𝑒𝑡_𝐷. The belief 𝐏(𝐷 | 𝐄=𝐞) can be computed in 𝑏𝑢𝑐𝑘𝑒𝑡_𝐷

𝐏(𝐷|𝐄=𝐞) = 𝛴_𝑎∊𝐴𝛴_𝑏∊𝐵 [ 𝐏(𝐷=𝑑|𝐴=𝑎,𝐵=𝑏) · 𝜋_𝐵→𝐷(𝑎, 𝑏) ] / 𝐏(𝐄=𝐞)

Compute Belief in ALL Variables

The example above can be generalized. We can compute the belief for every variable in 2 message passes along the bucket tree:

executing 𝐵𝐸-𝑏𝑒𝑙 with a variable as root

do second message passing from the root to the leaves along the bucket tree

Then at termination, the belief for each variable can be computed locally, in each bucket, consulting only the functions in its own bucket

Given an ordering of the variables, the first step of the algorithm generates the bucket tree by partitioning the functions into buckets and connecting the buckets into a tree. The subsequent top-down phase is identical to general bucket elimination. The bottom-up messages are defined as follows. The messages sent from the root up to the leaves will be denoted by 𝜋. The message from 𝐵_𝑗 to a child 𝐵_𝑖 is generated by multiplying the bucket’s function 𝑗 by all the 𝜋 message from its parent bucket and all the 𝜆 messages from its other child buckets and marginalizing (e.g. summing) over the eliminator from 𝐵_𝑗 to 𝐵_𝑖

we see that

downward messages (normal 𝐵𝐸 algorithm) are generated by eliminating a single variable

upward messages may be generated by eliminating 0, 1, or more variables

when 𝐵𝑇𝐸 terminates, each output bucket 𝐵_𝑖 contains:

𝜋 message it received from parent 𝐵_𝑗

its own function 𝐹_𝑖

𝜆 messages it received from each child 𝐵_𝑘

Definitions

bucket tree - the bucket tree has the buckets denoted {𝐵₁, …, 𝐵_𝑛} as its nodes. Each bucket contains a set of functions and a set of variables. The functions are those placed in the bucket according to the bucket partitioning rule where 𝜓_𝑖 is their product. The set of variables in 𝐵_𝑖, denoted 𝑠𝑐𝑜𝑝𝑒(𝐵_𝑖), is 𝑋_𝑖 and all its parents in the induced-graph (𝐆*, 𝑑). Each vertex 𝐵_𝑖 points to 𝐵_𝑗 (or, 𝐵_𝑗 is the parent of 𝐵_𝑖) if 𝑋_𝑗 is the closest neighbor of 𝑋_𝑖that appear before it in (𝐆*, 𝑑)

scope - If 𝐵_𝑗 is the parent of 𝐵_𝑖 in the bucket tree, then the separator of 𝑋_𝑖 and 𝑋_𝑗, 𝑠𝑒𝑝(𝐵_𝑖, 𝐵_𝑗) = 𝑠𝑐𝑜𝑝𝑒(𝐵_𝑖) ∩ 𝑠𝑐𝑜𝑝𝑒(𝐵_𝑗)

eliminator - Given a directed edge (𝐵_𝑖, 𝐵_𝑗) in the bucket tree, 𝑒𝑙𝑖𝑚(𝑖, 𝑗) is the set of variables in 𝐵_𝑖 and not in 𝐵_𝑗, namely 𝑒𝑙𝑖𝑚(𝐵_𝑖, 𝐵_𝑗) = 𝑠𝑐𝑜𝑝𝑒(𝐵_𝑖) - 𝑠𝑒𝑝(𝐵_𝑖, 𝐵_𝑗). We will call this set “the eliminator from 𝐵_𝑖 to 𝐵_𝑗“

Example BTE

the image below shows the messages passed:

𝜆 are messages passed in 𝐵𝐸-𝑏𝑒𝑙

𝜋 are messages passed in step 2

the 𝜋 functions in the bottom-up phase are computed as follows:

𝜋_𝐴→𝐵(𝑎) = 𝐏(𝐴)

𝜋_𝐵→𝐶(𝑐, 𝑎) = 𝐏(𝑏|𝑎) · 𝜋_𝐴→𝐵(𝑎) · 𝜆_𝐷→𝐵(𝑎, 𝑏)

𝜋_𝐵→𝐷(𝑎, 𝑏) = 𝐏(𝑏|𝑎) · 𝜋_𝐴→𝐵(𝑎) · 𝜆_𝐶→𝐵(𝑎, 𝑏)

𝜋_𝐶→𝐹(𝑎, 𝑏) = 𝛴_𝑎∊𝐴𝐏(𝑐|𝑎) · 𝜋_𝐵→𝐶(𝑐, 𝑎) · 𝜆_𝐹→𝐶(𝑏, 𝑐)

the image below shows the resulting output bucket tree

BTE - Algorithm (Explanation 1)

Click here to expand...

Construct a tree decomposition 𝑇

Initialize the tree decomposition as in bucket elimination

Select an arbitrary node of 𝑇 as root

Pass messages from leaves to root (upward pass)

Pass messages from root to leaves (downward pass)

Pseudocode

BTE - Algorithm (Explanation 2)

Click here to expand...

input

graphical model 𝓜 = ⟨𝐗,𝐃,𝐒,𝐅,𝐂⟩

ordering 𝑑

evidence 𝐄=𝐞

an ordering 𝑑 and a corresponding bucket-tree structure, in which for each node 𝑋_𝑖, its bucket 𝐵_𝑖 and its neighboring buckets are well defined

output

assume functions 𝐅 have been instantiated with evidence 𝐄=𝐞

for bucket 𝐵_𝑖 do:
----for each neighbor bucket 𝐵_𝑗 do:
--------once all messages from all other neighbors were received, do
------------compute and send to 𝐵𝑗 the message (𝜆_𝑖→𝑗):
----------------𝜆_𝑖→𝑗 ⟸ 𝛴_{𝑒𝑙𝑖𝑚(𝑖,𝑗)} [𝜓_𝑖·(𝜋_𝑘≠𝑗𝜆_𝑘→𝑖)]
----output: augmented buckets 𝐵‘₁, …, 𝐵’_𝑛, where each 𝐵’_𝑖 contains the original bucket functions and the 𝜆 messages it received

BTE - Complexity

Click here to expand...

Given a Graphical Model ℳ = ⟨𝐗, 𝐃, 𝐒, 𝐅, 𝐂⟩ whose primal/moral graph is 𝐆, let:

𝑤(𝑑) be its induced width of 𝐆 along ordering 𝑑

𝑘 the maximum domain size of 𝐷_𝑖 in 𝐃

𝑟 be the number of functions in 𝐅

𝑛 be the number of variables in 𝐗

𝑑𝑒𝑔 is the maximum degree of a node in the bucket tree

time complexity is 𝑂(𝑟 · 𝑑𝑒𝑔 · 𝑘^𝑤(𝑑)+1)
space complexity is 𝑂(𝑛 · 𝑘^𝑤(𝑑))

Complexity Given Evidence

the complexity can be further simplified by including observed/evidence variables into the picture

𝑤(𝑑) is replaced with 𝑤_𝐄(𝑑), where 𝑤_𝐄(𝑑) is the conditional induced width of the ordered graph 𝐆 along 𝑑, conditioned on the evidence 𝐄.

this statement is always true: 𝑤_𝐄(𝑑) ≤ 𝑤(𝑑)

time complexity is 𝑂(𝑟 · 𝑑𝑒𝑔 · 𝑘^{𝑤_𝐄(𝑑)+1})
space complexity is 𝑂(𝑛 · 𝑘^{𝑤_𝐄(𝑑)})

／var／log marcus chiu

Explorer

Probabilistic Inference - Bucket Tree Elimination (BTE) - Collect & Distribute Algorithm

From Bucket Elimination to Bucket Tree Elimination

Compute Belief in Variable 𝐴

Compute Belief in Variable 𝐷

Compute Belief in ALL Variables

Definitions

Example BTE

BTE - Algorithm (Explanation 1)

Pseudocode

BTE - Algorithm (Explanation 2)

input

output

BTE - Complexity

Complexity Given Evidence

／var／logmarcus chiu

Explorer

Probabilistic Inference - Bucket Tree Elimination (BTE) - Collect & Distribute Algorithm

From Bucket Elimination to Bucket Tree Elimination

Compute Belief in Variable 𝐴

Compute Belief in Variable 𝐷

Compute Belief in ALL Variables

Definitions

Example BTE

BTE - Algorithm (Explanation 1)

Pseudocode

BTE - Algorithm (Explanation 2)

input

output

BTE - Complexity

Complexity Given Evidence

／var／log marcus chiu