probabilistic inference recap
Link to originalgiven a probabilistic model, Probabilistic Inference is answering a probabilistic query over that model. It is also used for the computation/estimation of: distributions, the distribution’s parameters (if it’s a parametric distribution), the distribution’s probabilities, and/or the distribution’s characteristics
Click here to expand...
Link to originalLink to originalA graphical model 𝒢 is a tuple 𝒢 = ⟨𝐗, 𝐃, 𝐒, 𝐅, 𝐂⟩ where:
- 𝐗 = {𝑋1, …, 𝑋𝑛} set of ordered variables
- 𝐃 = {𝐷1, …, 𝐷𝑛} set of corresponding domains of each variable 𝑋𝑖 (e.g. if 𝑋1is a boolean variable then 𝐷1= {true, false}). The size of each 𝐷𝑖corresponds to the cardinality of variable 𝑋𝑖
- 𝐒 = {𝑆1, …, 𝑆𝑚} set of variable scopes, where each variable scope 𝑆𝑖is a subset of 𝐗 (i.e. 𝑆𝑖 ⊆ 𝐗)
- 𝐅 = {𝐹1, …, 𝐹𝑚} set of factors/functions, where each factor/function 𝐹𝑖 is defined over its corresponding variable scope 𝑆𝑖and maps any assignment over its scope to a real value
- in context of Bayesian Networks, 𝐅 = set of conditional probability tables
- in context of Markov Networks, 𝐅 = set of factors
- global function - is a function whose scope includes all variables (i.e. 𝑆𝑖 = 𝐗)
- local functions - is a function whose scope is a proper subset of variables (i.e. 𝑆𝑖⊂ 𝐗)
- 𝐂 is a set of combination operators which defines how functions are combined. common combination operators are:
- summation operator (𝛴)
- multiplication operator (𝛱)
- AND operator (∧) - for Boolean functions
- relational join operator (⨝) - when the functions are relation
- marginalization operator - for reasoning queries
- max operator - e.g. = argmax𝑦[ 𝐹𝑖(𝑥,𝑦) ] = 𝐹𝑗(𝑥) where 𝐹𝑗is a new function with scope over variable 𝑥
the set of local functions can be combined in a variety of ways (e.g. combination operators) to generate a new local function or even a global function
and let:
- 𝐐 = 𝑄1, …, 𝑄𝑟be the set of query variables
- 𝐄 = 𝐸1, …, 𝐸𝑠be the set of evidence variables
- 𝐇 = 𝐻1, …, 𝐻𝑡be the remainder of the variables(called the hidden variables (non-evidence, non-query variables))
- 𝐪 = 𝑞1, …, 𝑞𝑟be a grounded instantiation of 𝐐
- 𝐞 = 𝑒1, …, 𝑒𝑠be a grounded instantiation of 𝐄
- 𝐡 = 𝘩1, …, 𝘩𝑡be a grounded instantiation of 𝐇
𝑋, 𝐸, and 𝑌 are disjoint exhaustive sets of all variables 𝐗
therefore:
- the complete set of variables is 𝐗 = 𝐐 ∪ 𝐄 ∪ 𝐇 = {𝑋1, …, 𝑋𝑛} = {𝑄1, …, 𝑄𝑟, 𝐸1, …, 𝐸𝑠, 𝐻1, …, 𝐻𝑡}
- the complete set of instantiations is 𝒙 = 𝐪 ∪ 𝐞 ∪ 𝐡 = {𝑥1, …, 𝑥𝑛} = {𝑞1, …, 𝑞𝑟, 𝑒1, …, 𝑒𝑠, ℎ1, …, ℎ𝑡}
Click here to expand...
Link to original
- Parametric Probability Distribution Models - function/model based
- Non-Parametric Probability Distribution Models - instance-based
see also: ML - Parametric vs Non-Parametric
representations of global structures
representations of local structures
- representations of normalized distributions:
- representations of un-normalized distributions:
Click here to expand...
Link to original
- product comes from a joint probability 𝐏(𝐐=𝐪, 𝐇=𝐡, 𝐄=𝐞) being factorized into a product of conditional probabilities
- when we take the log of products it becomes sum of factors
Query
Product
SumMarginalization𝛴
Maximization𝑎𝑟𝑔𝑚𝑎𝑥
Description
Posterior Conditional Query𝐏(𝐐=𝐪|𝐄=𝐞) = 𝛴𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡, 𝐄=𝐞) ] / 𝐏(𝐄=𝐞)
Belief Updating Query
(a type of posterior conditional query)𝐏(𝑄𝑖=𝑞𝑖|𝐄=𝐞) = 𝛴𝐡∊𝐇[ 𝐏(𝑄𝑖=𝑞𝑖, 𝐇=𝐡, 𝐄=𝐞) ] / 𝐏(𝐄=𝐞)✔
✔
✘
- sum-product problem
- sum-log-sums problem
Prior Marginal Query𝐏(𝐐=𝐪) = 𝛴𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡) ]
Probability of Evidence Query𝐏(𝐄=𝐞) = 𝛴𝐡∊𝐇[ 𝐏(𝐇=𝐡, 𝐄=𝐞) ]
✔
✔
✘
- sum-product problem
- sum-log-sums problem
- both queries essentially the same where 𝐐 = 𝐄
- #P-Hard Problem
Maximum a Posterior (𝑴𝑨𝑷) Query𝑀𝐴𝑃(𝐐=?|𝐄=𝐞) = 𝑎𝑟𝑔𝑚𝑎𝑥𝐪[ 𝛴𝐡∊𝐇[ 𝐏(𝐐=𝐪, 𝐇=𝐡, 𝐄=𝐞) ] ]
where: 𝐐∪𝐄⊂𝐗 and 𝐐∪𝐇∪𝐄=𝐗✔
✔
✔
- max-sum-product problem
- max-sum-log-sums problem
- most probable assignment because they include maximization
- NPPP-Hard Problem
Most Probable Explanation (𝑴𝑷𝑬) Query𝑀𝑃𝐸(𝐐=?|𝐄=𝐞) = 𝑎𝑟𝑔𝑚𝑎𝑥𝐪[ 𝐏(𝐐=𝐪, 𝐄=𝐞) ]
where: 𝐐∪𝐄=𝐗✔
✘
✔
- max-product problem
- max-log-sums problem
- most probable assignment because they include maximization
- NP-Hard Problem
more details: Probabilistic Inference - Query/Task Types (Posterior Conditional - Prior Marginal / Probability of Evidence - MPE - MAP)
resources:
Click here to expand...
Bucket Tree Elimination (BTE) - Collect & Distribute Algorithm
- is an extension of a specific VE Algorithm called 𝐵𝐸-𝑏𝑒𝑙 algorithm, where a node is chosen as root, messages start at leaves passing it up to the root, and then messages are sent back down to leaves
- since BE/VE Algorithm induces a cluster tree on which BE/VE Algorithm is performed, BTE is also a type of belief propagation algorithm
The 𝐵𝐸-𝑏𝑒𝑙 algorithm is designed to compute the probability of 𝐏(𝑄𝑖|𝐄=𝑒), in other words, the belief of 𝑄𝑖. It is often desirable to also compute the belief query for each and every variable in the network: 𝐏(𝑄1|𝐄=𝑒), …, 𝐏(𝑄𝑖-1|𝐄=𝑒), 𝐏(𝑄𝑖+1|𝐄=𝑒), …, 𝐏(𝑄𝑛|𝐄=𝑒). A brute-force approach will require running 𝐵𝐸-𝑏𝑒𝑙 algorithm 𝑛 times, one for each variable. We will show next that this is unnecessary. By viewing bucket-elimination as a message passing algorithm along a rooted bucket tree, we can augment it with a second message passing phase in the opposite direction, from root to leaves, and thus every belief query would be computed
From Bucket Elimination to Bucket Tree Elimination
Click here to expand...
consider the Bayesian Network below
Compute Belief in Variable 𝐴
we would like to compute the probabilistic query 𝐏(𝐴|𝐄=𝐞) where:
- 𝐴 is a single variable denoting the season
- 𝐄=𝐞 is a set of observed variables assigned with values. we call 𝐄=𝐞 evidence
Figure a shows the process of 𝐵𝐸-𝑏𝑒𝑙 with:
- buckets with ordering 𝑑 = {𝐴, 𝐵, 𝐶, 𝐷, 𝐹, 𝐺}
- 𝜆 messages passed by 𝐵𝐸 from top to bottom
Figure b depicts the same computation as message-passing along a tree which we will refer to as a bucket tree
at the termination of 𝐵𝐸-𝑏𝑒𝑙 both 𝐏(𝐴, 𝐄=𝐞) and 𝐏(𝐄=𝐞) are computed, thus belief in 𝐴 given 𝐄=𝐞 is known via Bayes Rule. 𝐏(𝐴 | 𝐄=𝐞) = 𝐏(𝐴, 𝐄=𝐞) / 𝐏(𝐄=𝐞)
Compute Belief in Variable 𝐷
Now, what if we want to compute 𝐏(𝐷 | 𝐄=𝐞)? We could restart the algorithm using some new arbitrary ordering whose first variable is 𝐷, such as {𝐷, 𝐵, 𝐴, 𝐶, 𝐹, 𝐺}. But rather than doing all the computations from scratch, we can take the existing bucket tree created from the previous computation and reorient the edges to make 𝐷 the root of the tree. Reorienting the tree so that 𝐷 is the root, requires reversing only 2 edges, (𝐵, 𝐷) and (𝐴, 𝐵), suggesting that we only need to recompute messages from the node 𝐴 to 𝐵 and from 𝐵 to 𝐷. By definition, we can compute the belief in 𝐷 by the expression
- 𝐏(𝐷|𝐄=𝐞) = 𝛴𝑎∊𝐴𝛴𝑏∊𝐵 [ 𝐏(𝐴=𝑎) · 𝐏(𝐵=𝑏|𝐴=𝑎) · 𝐏(𝐷=𝑑|𝐴=𝑎,𝐵=𝑏) · 𝜆𝐶→𝐵(𝑏) ] / 𝐏(𝐄=𝐞)
This computation can be carried also over the bucket tree, whose downward messages were already passed, in three steps:
- the first executed in 𝑏𝑢𝑐𝑘𝑒𝑡𝐴, where the function 𝐏(𝐴) is moved to 𝑏𝑢𝑐𝑘𝑒𝑡𝐵
- 𝜋𝐴→𝐵(𝑎) = 𝐏(𝐴)
- the second is executed by 𝑏𝑢𝑐𝑘𝑒𝑡𝐵, computing a function (a product) that is moved to 𝑏𝑢𝑐𝑘𝑒𝑡𝐷
- 𝜋𝐵→𝐷(𝑎, 𝑏) = 𝐏(𝑏|𝑎) · 𝜋𝐴→𝐵(𝑎) · 𝜆𝐶→𝐵(𝑎, 𝑏)
- the final computation is carried in 𝑏𝑢𝑐𝑘𝑒𝑡𝐷. The belief 𝐏(𝐷 | 𝐄=𝐞) can be computed in 𝑏𝑢𝑐𝑘𝑒𝑡𝐷
- 𝐏(𝐷|𝐄=𝐞) = 𝛴𝑎∊𝐴𝛴𝑏∊𝐵 [ 𝐏(𝐷=𝑑|𝐴=𝑎,𝐵=𝑏) · 𝜋𝐵→𝐷(𝑎, 𝑏) ] / 𝐏(𝐄=𝐞)
Compute Belief in ALL Variables
The example above can be generalized. We can compute the belief for every variable in 2 message passes along the bucket tree:
- executing 𝐵𝐸-𝑏𝑒𝑙 with a variable as root
- do second message passing from the root to the leaves along the bucket tree
Then at termination, the belief for each variable can be computed locally, in each bucket, consulting only the functions in its own bucket
Given an ordering of the variables, the first step of the algorithm generates the bucket tree by partitioning the functions into buckets and connecting the buckets into a tree. The subsequent top-down phase is identical to general bucket elimination. The bottom-up messages are defined as follows. The messages sent from the root up to the leaves will be denoted by 𝜋. The message from 𝐵𝑗 to a child 𝐵𝑖 is generated by multiplying the bucket’s function 𝑗 by all the 𝜋 message from its parent bucket and all the 𝜆 messages from its other child buckets and marginalizing (e.g. summing) over the eliminator from 𝐵𝑗 to 𝐵𝑖
we see that
- downward messages (normal 𝐵𝐸 algorithm) are generated by eliminating a single variable
- upward messages may be generated by eliminating 0, 1, or more variables
when 𝐵𝑇𝐸 terminates, each output bucket 𝐵𝑖 contains:
- 𝜋 message it received from parent 𝐵𝑗
- its own function 𝐹𝑖
- 𝜆 messages it received from each child 𝐵𝑘
Definitions
- bucket tree - the bucket tree has the buckets denoted {𝐵1, …, 𝐵𝑛} as its nodes. Each bucket contains a set of functions and a set of variables. The functions are those placed in the bucket according to the bucket partitioning rule where 𝜓𝑖 is their product. The set of variables in 𝐵𝑖, denoted 𝑠𝑐𝑜𝑝𝑒(𝐵𝑖), is 𝑋𝑖 and all its parents in the induced-graph (𝐆*, 𝑑). Each vertex 𝐵𝑖 points to 𝐵𝑗 (or, 𝐵𝑗 is the parent of 𝐵𝑖) if 𝑋𝑗 is the closest neighbor of 𝑋𝑖that appear before it in (𝐆*, 𝑑)
- scope - If 𝐵𝑗 is the parent of 𝐵𝑖 in the bucket tree, then the separator of 𝑋𝑖 and 𝑋𝑗, 𝑠𝑒𝑝(𝐵𝑖, 𝐵𝑗) = 𝑠𝑐𝑜𝑝𝑒(𝐵𝑖) ∩ 𝑠𝑐𝑜𝑝𝑒(𝐵𝑗)
- eliminator - Given a directed edge (𝐵𝑖, 𝐵𝑗) in the bucket tree, 𝑒𝑙𝑖𝑚(𝑖, 𝑗) is the set of variables in 𝐵𝑖 and not in 𝐵𝑗, namely 𝑒𝑙𝑖𝑚(𝐵𝑖, 𝐵𝑗) = 𝑠𝑐𝑜𝑝𝑒(𝐵𝑖) - 𝑠𝑒𝑝(𝐵𝑖, 𝐵𝑗). We will call this set “the eliminator from 𝐵𝑖 to 𝐵𝑗“
Example BTE
the image below shows the messages passed:
- 𝜆 are messages passed in 𝐵𝐸-𝑏𝑒𝑙
- 𝜋 are messages passed in step 2
the 𝜋 functions in the bottom-up phase are computed as follows:
- 𝜋𝐴→𝐵(𝑎) = 𝐏(𝐴)
- 𝜋𝐵→𝐶(𝑐, 𝑎) = 𝐏(𝑏|𝑎) · 𝜋𝐴→𝐵(𝑎) · 𝜆𝐷→𝐵(𝑎, 𝑏)
- 𝜋𝐵→𝐷(𝑎, 𝑏) = 𝐏(𝑏|𝑎) · 𝜋𝐴→𝐵(𝑎) · 𝜆𝐶→𝐵(𝑎, 𝑏)
- 𝜋𝐶→𝐹(𝑎, 𝑏) = 𝛴𝑎∊𝐴𝐏(𝑐|𝑎) · 𝜋𝐵→𝐶(𝑐, 𝑎) · 𝜆𝐹→𝐶(𝑏, 𝑐)
the image below shows the resulting output bucket tree
BTE - Algorithm (Explanation 1)
Click here to expand...
- Construct a tree decomposition 𝑇
- Initialize the tree decomposition as in bucket elimination
- Select an arbitrary node of 𝑇 as root
- Pass messages from leaves to root (upward pass)
- Pass messages from root to leaves (downward pass)
Pseudocode
BTE - Algorithm (Explanation 2)
Click here to expand...
input
- graphical model 𝓜 = ⟨𝐗,𝐃,𝐒,𝐅,𝐂⟩
- ordering 𝑑
- evidence 𝐄=𝐞
an ordering 𝑑 and a corresponding bucket-tree structure, in which for each node 𝑋𝑖, its bucket 𝐵𝑖 and its neighboring buckets are well defined
output
assume functions 𝐅 have been instantiated with evidence 𝐄=𝐞
for bucket 𝐵𝑖 do:
----for each neighbor bucket 𝐵𝑗 do:
--------once all messages from all other neighbors were received, do
------------compute and send to 𝐵𝑗 the message (𝜆𝑖→𝑗):
----------------𝜆𝑖→𝑗 ⟸ 𝛴𝑒𝑙𝑖𝑚(𝑖,𝑗) [𝜓𝑖·(𝜋𝑘≠𝑗𝜆𝑘→𝑖)]
----output: augmented buckets 𝐵‘1, …, 𝐵’𝑛, where each 𝐵’𝑖 contains the original bucket functions and the 𝜆 messages it received
BTE - Complexity
Click here to expand...
Given a Graphical Model ℳ = ⟨𝐗, 𝐃, 𝐒, 𝐅, 𝐂⟩ whose primal/moral graph is 𝐆, let:
- 𝑤(𝑑) be its induced width of 𝐆 along ordering 𝑑
- 𝑘 the maximum domain size of 𝐷𝑖 in 𝐃
- 𝑟 be the number of functions in 𝐅
- 𝑛 be the number of variables in 𝐗
- 𝑑𝑒𝑔 is the maximum degree of a node in the bucket tree
time complexity is 𝑂(𝑟 · 𝑑𝑒𝑔 · 𝑘𝑤(𝑑)+1)
space complexity is 𝑂(𝑛 · 𝑘𝑤(𝑑))Complexity Given Evidence
the complexity can be further simplified by including observed/evidence variables into the picture
𝑤(𝑑) is replaced with 𝑤𝐄(𝑑), where 𝑤𝐄(𝑑) is the conditional induced width of the ordered graph 𝐆 along 𝑑, conditioned on the evidence 𝐄.
this statement is always true: 𝑤𝐄(𝑑) ≤ 𝑤(𝑑)
time complexity is 𝑂(𝑟 · 𝑑𝑒𝑔 · 𝑘𝑤𝐄(𝑑)+1)
space complexity is 𝑂(𝑛 · 𝑘𝑤𝐄(𝑑))
---collect--and--distribute-algorithm/sample-bayesian-network.png)
---collect--and--distribute-algorithm/bucket-elimination-example.png)
---collect--and--distribute-algorithm/bucket-tree-algorithm-propagation-of-messages.png)
---collect--and--distribute-algorithm/bucket-tree-algorithm-augmented-output-bucket-tree.png)
---collect--and--distribute-algorithm/bucket-tree-elimination-algorithm.png)