Sections

  • Representing Knowledge in an Uncertain Domain
  • The Semantics of Bayesian Networks
  • Efficient Representations of Conditional Distributions
  • Exact Inference of Bayesian Networks
  • Approximate Inference in Bayesian Networks
  • Relational and First-Order Probability Models
  • Other Approaches to Uncertain Reasoning

Representing Knowledge in an Uncertain Domain

A Bayesian Network is a Directed Acyclic Graph (DAG) in which each node is annotated with probability information. The full specification is as follows:

  1. each node corresponds to a random variable (discrete or continuous)
  2. a set of directed links or arrows connects pairs of nodes
    1. an arrow from node X to node Y, means X is a parent of Y
    2. the graph has no directed cycles (i.e. Directed Acyclic Graph (DAG))
  3. each node Xi has a conditional probability distribution P(Xi|Parents(Xi))
Example Bayesian Network

suppose you have a new burglar alarm installed at home. It is fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes. You also have two neighbors, John and Mary, who have promised to call you at work when they hear the alarm. John nearly always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too. Mary, on the other hand, likes rather loud music and often misses the alarm altogether. Given the evidence of who has or has not called, we would like to estimate the probability of a burglarybayesian network for this domain 

Figure 14.2 shows several examples of a Conditional Probability Table (CPT):

  • a CPT is used for representing a conditional probability distribution on a node variable when given 0 or more parent node values
  • each row in the CPT contains the conditional probability of a node value given a conditioning case (i.e. a conditioning case is just a possible combination of values for the parent nodes)

The Semantics of Bayesian Networks

There are 2 equivalent views in understanding the semantics of Bayesian networks:

Representing the Full Joint Distribution

where parents(Xi) denotes the values of Parents(Xi) that appear in x1, …, xn

To illustrate this, we can calculate the probability that the alarm has sounded, but neither a burglary nor an earthquake has occurred, and both John and Mary call. We multiply entries from the joint distribution

P(j,m,a,¬b,¬e) = P(j | a) P(m | a) P(a | ¬b ∧ ¬e) P(¬b) P(¬e)
P(j,m,a,¬b,¬e) = 0.90 × 0.70 × 0.001 × 0.999 × 0.998
P(j,m,a,¬b,¬e) = 0.000628

Constructing Bayesian Networks

First, we rewrite the entries in the joint distribution in terms of conditional probability, using the product rule

P(x1, …, xn) = P(xn| xn−1, …, x1) P(xn−1| xn−2, …, x1) ··· P(x2| x1) P(x1)P(x1, …, xn) = ∏1nP(xi| xi−1, …, x1)

this identity is called chain rule

Comparing it with Equation (14.2), we see that the specification of the joint distribution is equivalent to the general assertion that, for every variable Xiin the network,

P(Xi| Xi−1, …, X1) = P(Xi| Parents(Xi)), provided that Parents(Xi) ⊆ {Xi−1, …, X1

This last condition (Parents(Xi) ⊆ {Xi−1, …, X1}) is satisfied by numbering the nodes in a way that is consistent with the partial order implicit in the graph structure

What Equation (AI Chapter 14 - Probabilistic Reasoning) says is that the Bayesian network is a correct representation of the domain only if each node is conditionally independent of its other predecessors in the node ordering, given its parents. We can satisfy this condition with this methodology:

  1. Nodes: first determine the set of variables that are required to model the domain. then order them, {X1, …, Xn}. any order will work, but the resulting network will be more compact if the variables are ordered such that causes precede effects.
  2. Links: for i = 1 to n:
    1. for Xichoose a minimal set of parents from {X1, …, Xi−1}, such that AI Chapter 14 - Probabilistic ReasoningP(Xi| Xi−1, …, X1) = P(Xi| Parents(Xi)) is satisfied
    2. For each parent insert a link from the parent to Xi
    3. CPTs: Write down the conditional probability table, P(Xi|Parents(Xi))

Intuitively, the parents of node Xishould contain all those nodes in X1, …, Xi−1that directly influence Xi. For example, suppose we have completed the network in Figure 14.2 except for the choice of parents for MaryCalls. MaryCalls is certainly influenced by whether there is a Burglary or an Earthquake, but not directly influenced. Intuitively, our knowledge of the domain tells us that these events influence Mary’s calling behavior only through their effect on the alarm. Also, given the state of the alarm, whether John calls has no influence on Mary’s calling. Formally speaking, we believe that the following conditional independence statement holds:

P(MaryCalls | JohnCalls, Alarm, Earthquake, Burglary) = P(MaryCalls | Alarm)

Thus, Alarm will be the only parent node for MaryCalls

Compactness

a Bayesian network can often be far more compact than the full joint distribution

the compactness is an example of a general property of locally structured (also called sparse) systems. In a locally structured system, each subcomponent interacts directly with only a bounded number of other components, regardless of the total number of components. Local structure is usually associated with linear rather than exponential growth in complexity. In the case of Bayesian networks, it is reasonable to suppose that in most domains each random variable is directly influenced by at most k others, for some constant k. If we assume n Boolean variables for simplicity, then the amount of information needed to specify each conditional probability table will be at most 2knumbers, and the complete network can be specified by n2knumbers. In contrast, the joint distribution contains 2nnumbers. To make this concrete, suppose we have n = 30 nodes, each with five parents (k = 5). Then the Bayesian network requires 960 numbers, but the full joint distribution requires over a billion

Node Ordering

Even in a locally structured domain, we will get a compact Bayesian network only if we choose the node ordering well. What happens if we happen to choose the wrong order? Consider the burglary example again. Suppose we decide to add the nodes in the order [MaryCalls, JohnCalls, Alarm, Burglary, Earthquake]. We then get the somewhat more complicated network shown in Figure 14.3(a). The process goes as follows:

  • Adding MaryCalls: No parents.
  • Adding JohnCalls: If Mary calls, that probably means the alarm has gone off, which of course would make it more likely that John calls. Therefore, JohnCalls needs MaryCalls as a parent.
  • Adding Alarm: Clearly, if both call, it is more likely that the alarm has gone off than if just one or neither calls, so we need both MaryCalls and JohnCalls as parents.
  • Adding Burglary: If we know the alarm state, then the call from John or Mary might give us information about our phone ringing or Mary’s music, but not about burglary: P(Burglary | Alarm, JohnCalls, MaryCalls) = P(Burglary | Alarm) Hence we need just Alarm as parent.
  • Adding Earthquake: If the alarm is on, it is more likely that there has been an earthquake. (The alarm is an earthquake detector of sorts.) But if we know that there has been a burglary, then that explains the alarm, and the probability of an earthquake would be only slightly above normal. Hence, we need both Alarm and Burglary as parents

The resulting network has 3 more links than the original network in AI Chapter 14 - Probabilistic Reasoning and requires three more probabilities to be specified

[MaryCalls, JohnCalls, Earthquake, Burglary, Alarm] - is a bad node ordering as shown in Figure 14.3(b)
[EarthquakeBurglary, Alarm, MaryCalls, JohnCalls] - is a good node ordering as shown in AI Chapter 14 - Probabilistic Reasoning

Conditional Independence Relations in Bayesian Networks

AI Chapter 14 - Probabilistic Reasoning provides a “numerical” semantics for Bayesian Networks in terms of the representation of the full joint distribution

we can also go in the other direction. We can start from a “topological” semantics that specifies the conditional independence relationships encoded by the graph structure, and from this we can derive the “numerical” semantics

topological semantics
  • specifies that each variable is conditionally independent of its non-descendants, given its parents
  • for example, inFigure 14.2, JohnCalls is independent of Burglary, Earthquake, and MaryCalls given the value of Alarm
  • this property is illustrated in Figure 14.4(a)
implied by topological semantics
  • a node is conditionally independent of all other nodes in the network, given its parents, children, and children’s parents—that is, given its Markov blanket.
  • for example, Burglary is independent of JohnCalls and MaryCalls, given Alarm and Earthquake
  • this property is illustrated in Figure 14.4(b)

Efficient Representations of Conditional Distributions

if the maximum number of parents = k, filling in the CPT for a node requires up to O(2k) numbers Usually the relationship between parents and the child can be describable by a canonical distribution that fits some standard pattern. In such cases, the complete CPT table can be specified by naming the pattern and perhaps supplying a few parameters—much easier than supplying an exponential number of parameters

Deterministic Nodes

  • a simple example of canonical distribution
  • a deterministic node has its value specified exactly by the values of its parents, with no uncertainty
  • The relationship can be a logical one: for example, the relationship between the parent nodes Canadian, US, Mexican and the child node NorthAmerican is simply that the child is the disjunction of the parents
  • The relationship can also be numerical: for example, if the parent nodes are the prices of a particular model of car at several dealers and the child node is the price that a bargain hunter ends up paying, then the child node is the minimum of the parent values; or if the parent nodes are a lake’s inflows (rivers, runoff, precipitation) and outflows (rivers, evaporation, seepage) and the child is the change in the water level of the lake, then the value of the child is the sum of the inflow parents minus the sum of the outflow parents

Noisy-OR Relation

  • is a generalization of the logical OR. In propositional logic, we might say that Fever is true if and only if Cold, Flu, or Malaria is true. The noisy-OR model allows for uncertainty about the ability of each parent to cause the child to be true—the causal relationship between parent and child may be inhibited, and so a patient could have a cold, but not exhibit a fever

  • First, it assumes that all the possible causes are listed. (If some are missing, we can always add a so-called leak node that covers “miscellaneous causes.”) Second, it assumes that inhibition of each parent is independent of inhibition of any other parents: for example, whatever inhibits Malaria from causing a fever is independent of whatever inhibits Flu from causing a fever. Given these assumptions, Fever is false if and only if all its true parents are inhibited, and the probability of this is the product of the inhibition probabilities q for each parent

  • In general, noisy logical relationships in which a variable depends on k parents can be described using O(k) parameters instead of O(2k)for the full conditional probability table (as shown in the following example)

  • Let us suppose these individual inhibition probabilities are as follows:
    qcold= P(¬fever | cold, ¬flu, ¬malaria) = 0.6
    qflu= P(¬fever | ¬cold, flu, ¬malaria) = 0.2
    qmalaria= P(¬fever | ¬cold, ¬flu, malaria) = 0.1

  • Then, from this information and the noisy-OR assumptions, the entire CPT can be built. The general rule is that

    List indent undo

  • where the product is taken over the parents that are set to true for that row of the CPT. The following table illustrates this calculation

    List indent undo

Bayesian Nets With Continuous Variables

Methods in Dealing With Continuous Values
  • discretization - dividing up the possible values into a fixed set of intervals (e.g. temperatures could be divided into (<0oC), (0oC−100oC), and (>100oC))
  • probability density functions such as Gaussian (or normal) distribution N(μ,σ2)(x) has the mean μ and the variance σ2 as parameters
  • nonparametric representation - define the conditional distribution implicitly with a collection of instances, each containing specific values of the parent and child variables
Hybrid Bayesian Network
  • hybrid Bayesian network is a network with both discrete and continuous variables

  • To specify a hybrid network, we have to specify two new kinds of distributions:

    • the conditional distribution for a continuous variable given discrete or continuous parents
    • the conditional distribution for a discrete variable given continuous parents
  • Consider the simple example in Figure 14.5, in which a customer buys some fruit depending on its cost, which depends in turn on the size of the harvest and whether the government’s subsidy scheme is operating

    List indent undo

    • variable Cost is continuous and has continuous and discrete parents
    • variable Buys is discrete and has a continuous parent
  • For the Cost continuousvariable with 2 parents: discrete variable Subsidy and continuous variable Harvest, we need to specify P(Cost | Harvest, Subsidy)

    • the discrete parent Subsidy is handled by enumeration—that is, by specifying both P(Cost | Harvest, subsidy) and P(Cost | Harvest , ¬subsidy)
    • the continuous parent Harvest is handled by specifying how the distribution over the cost c depends on the continuous value h of Harvest the most common choice is the linear Gaussian distribution, in which the child has a Gaussian distribution whose mean μ varies linearly with the value of the parent and whose standard deviation σ is fixed. We need two distributions: one for subsidy and one for ¬subsidy with different parameters:

      List indent undo

  the conditional distribution for Cost is specified by naming the linear Gaussian distribution and providing the parameters at, bt, σt, af , bf , and σf . Figures 14.6(a) and (b) show these two relationships

List indent undo

  Notice that in each case the slope is negative, because cost decreases as supply increases. (Of course, the assumption of linearity implies that the cost becomes negative at some point; the linear model is reasonable only if the harvest size is limited to a narrow range.) Figure 14.6(c) shows the distribution P(c|h), averaging over the two possible values of Subsidy and assuming that each has prior probability 0.5
  • For the discrete variable Buys with a continuous variable parent Cost
    • It seems reasonable to assume that the customer will buy if the cost is low and will not buy if it is high and that the probability of buying varies smoothly in some intermediate region. In other words, the conditional distribution is like a “soft” threshold function
    • 2 methods for this “soft” threshold function:
      • probit distribution - Figure 14.7(a)
        • uses the standard normal distribution to produce a soft threshold:

          List indent undo

        • Then the probability of Buys given Cost might be

          List indent undo

        • which means that the cost threshold occurs around μ, the width of the threshold region is proportional to σ, and the probability of buying decreases as cost increases
      • logit distribution - Figure 14.7(b)

    List indent undo

Exact Inference of Bayesian Networks

the basic task for any probabilistic inference system is to compute the posterior probability distribution for a set of query variables, given some observed event—that is, some assignment of values to a set of evidence variables.

In this section we discuss exact algorithms for computing posterior probabilities and will consider the complexity of this task. It turns out that the general case is intractable, so section AI Chapter 14 - Probabilistic Reasoning covers methods for approximate inference

Exact Algorithms

Computing Posterior Probability Distribution - Simple Single Query Variable

to simplify the presentation, we will consider only one query variable at a time:

  • X denotes the query variable
  • E denotes the set of evidence variables E1, …, Em
  • e is a particular observed event
  • Y will denotes the non-evidence, non-query variables Y1, …, Yl(called the hidden variables)

thus, the complete set of variables is X = {X} ∪ E Ya typical query asks for the posterior probability distribution P(X|e)

In the burglary network shown below, we might observe the event in which JohnCalls=true and MaryCalls=true. We could then ask for, say, the probability that a burglary has occurred:

P(Burglary | JohnCalls=true, MaryCalls=true) = ⟨0.284, 0.716⟩

Inference by Enumeration

a query P(X|e) is a conditional probability and as Chapter 13 has explained general inference procedure showing that conditional probabilities can be computed by summing terms from the full joint distribution:

Equation 13.9 - P(X|e) = α P(X,e) = α Σy P(X,e,y), where y is a particular un-observed event (y is a grounded instantiation of non-evidence, non-query variables Y

Now, a Bayesian network gives a complete representation of the full joint distribution. More specifically, AI Chapter 14 - Probabilistic Reasoning shows that the terms P(x, e, y) in the joint distribution can be written as products of conditional probabilities from the network. Therefore, a posterior conditional query can be answered using a Bayesian network by computing sums of products of conditional probabilities from the network

For example, the query P(Burglary | JohnCalls=true, MaryCalls=true). The hidden variables for this query are Earthquake and Alarm. From AI Chapter 14 - Probabilistic Reasoning we have:

P(B | j,m) = α P(B,j,m) = α ΣeΣa P(B,j,m,e,a)

The semantics of Bayesian networks AI Chapter 14 - Probabilistic Reasoning then gives us an expression in terms of CPT entries. For simplicity we do this just for Burglary=true:

P(b | j,m) = α ΣeΣaP(b) P(e) P(a|b,e) P(j|a) P(m|a)

To compute this expression, we have to add four terms, each computed by multiplying five numbers. In the worst case, where we have to sum out almost all the variables, the complexity of the algorithm for a network with n Boolean variables is O(n2n)

An improvement can be obtained from the following simple observations: the P(b) term is a constant and can be moved outside the summations over a and e, and the P(e) term can be moved outside the summation over a. Hence, we have:

P(b | j,m) = α P(b) Σe P(e) ΣaP(a|b,e) P(j|a) P(m|a)

Using the numbers from AI Chapter 14 - Probabilistic Reasoning, we obtain:

P(b | j,m) = α × 0.00059224
P(¬b|j,m) = α × 0.0014919

hence:

P(B | j, m) = α ⟨0.00059224, 0.0014919⟩
P(B | j, m) ≈ ⟨0.284, 0.716⟩

The structure of P(b | j,m) computation is shown in Figure 14.8 below

The ENUMERATION-ASK algorithm in Figure 14.9 evaluates such trees using depth-first recursion. The algorithm is very similar in structure to the backtracking algorithm for solving CSPs (Figure 6.5) and the DPLL algorithm for satisfiability (Figure 7.17).

The space complexity of ENUMERATION-ASK is only linear in the number of variables: the algorithm sums over the full joint distribution without ever constructing it explicitly. Unfortunately, its time complexity for a network with n Boolean variables is always O(2n) better than the O(n 2n) for the simple approach described earlier, but still rather grim.

Note that the tree in Figure 14.8 makes explicit the repeated sub-expressions evaluated by the algorithm. The products P(j|a) P(m|a) and P(j|¬a) P(m|¬a) are computed twice, once for each value of e. The next section describes a general method that avoids such wasted computations

The Variable Elimination Algorithm

TODO page 543 of AI Textbook 3rd Edition

The Complexity of Exact Inference

the complexity of exact inference in Bayesian networks depends strongly on the structure of the network

singly connected networks or polytrees

  • the time and space complexity of exact inference in polytrees is linear in the size of the network
  • the size is defined as the number of CPT entries; if the number of parents of each node is bounded by a constant, then the complexity will also be linear in the number of nodes.

connected networks

  • variable elimination can have exponential time and space complexity in the worst case, even when the number of parents per node is bounded
  • e.g. see Figure 14.12(a)
Clustering Algorithms

The basic idea of clustering is to join individual nodes of the network to form cluster nodes in such a way that the resulting network is a polytree

Approximate Inference in Bayesian Networks

2 families of algorithms:

  • direct sampling
  • Markov chain sampling

TODO page 549 of AI Textbook 3rd Edition

Relational and First-Order Probability Models

Other Approaches to Uncertain Reasoning