Sections

  • Acting Under Uncertainty
  • Basic Probability Notation
  • Inference Using Full Joint Distributions
  • Independence
  • Bayes’ Rule and Its Use

Acting Under Uncertainty

trying to use classical logic (e.g. propositional and first-order logic) to cope with a domain like medical diagnosis fails for 3 main reasons:

  • laziness: it is too much work to list the complete set of antecedents or consequents needed to ensure an exception-less rule and too hard to use such rules
  • ignorance - lack of knowledge or information
    • theoretical ignorance: medical science has no complete theory for the domain
    • practical ignorance: even if we know all the rules, we might be uncertain about a particular patient because not all the necessary tests have been or can be run

classical logic has a qualification problem - the impossibility of listing all the preconditions required for a real-world action to have its intended effect

probability provides a way of summarizing the uncertainty that comes from our laziness and ignorance

classical logic statements are made with respect to the real world
probability statements are made with respect to a knowledge state, not to the real world

Uncertainty and Rational Decisions

Consider the A90plan for getting to the airport (leaving home 90 minutes before the flight departs and driving at a reasonable speed). Suppose it gives us a 97% chance of catching our flight. Does this mean it is a rational choice? Not necessarily: there might be other plans, such as A180, with higher probabilities. If it is vital not to miss the flight, then it is worth risking the longer wait at the airport. What about A1440, a plan that involves leaving home 24 hours in advance? In most circumstances, this is not a good choice, because although it almost guarantees getting there on time, it involves an intolerable wait—not to mention a possibly unpleasant diet of airport food.

To make such choices, an agent must have preferences between different outcomes (a completely specified state) of various plans.

Representing and Reasoning Preferences

utility theory

  • used to represent preferences and reasoning with it
  • utility means “the quality of being useful”
  • the utility of a outcome/state is relative to an agent
  • a utility function accounts for any set of preferences
Theory of Rational Decisions

decision theory

  • decision-theory probability-theory utility-theory
  • fundamental idea of decision theory is that an agent is rational if and only if it chooses an action of maximum expected utility (MEU)
  • maximum expected utility (MEU) - choosing an action out of a set of all possible actions in current state that yields the highest expected utility (an average over all the possible outcomes of the action)
  • a decision-theoretic agent’s belief state represents not just the possibilities for world states but also their probabilities

Where Do Probabilities Come From?

Basic Probability Notation

Probability Model

a probability model is defined by 3 components: its sample spaceevents within the sample space, and probabilities associated with each event

  • sample space is a set of all possible worlds
    • the sample space of possible worlds are:
      • mutually exclusive - two distinct possible worlds cannot both be a single case in the sample space
      • exhaustive - any one possible world must be a case in the sample space
    • Ω (uppercase omega) - sample space
    • ω (lowercase omega) - an element or elements of the space (possible worlds)
  • events is a subset of the sample space
    • an event is described as a propositional statement which is associated subset of all possible worlds
    • i.e. a propositional-event is a subset of all possible worlds where the proposition statement is held true
    • atomic event is an event with 1 possible world
      • is a complete specification of the state of the world about which the agent is uncertain
      • If the world is described by a set of random variables, an atomic event is a particular assignment of values to the random variables
  • probability is a numerical value assigned to a given event
    • the probability of an event is the summation of all P(ω) where the proposition holds true in ω
    • for all events φ: P(φ) = [ ΣP(ω) : ω∈φ ].where ω is a possible world satisfying φ

a probability model associates a numerical probability P(ω) with each possible world, while adhering to the axioms of probability theory:

  • 0 ≤ P(ω) ≤ 1 for every ω - every possible world has a probability between 0 and 1
  • ΣP(ω) = 1 - the total probability of the set of possible worlds is 1

probabilistic assertions and queries are not usually about particular possible worlds, but about sets of them

Unconditional and Conditional Probabilities

(unconditional or prior) probabilities - refer to degrees of belief in propositions in the absence of any other information or observed evidence(s)
(conditional or posteriorprobabilities - refer to degrees of belief in propositions in the presence of any other information or observed evidence(s)

relationship between prior and posterior probabilities see AI Chapter 13 - Quantifying Uncertainty

Conditional Implication vs Logical Implication

consider the following assertion: P(cavity | toothache) = 0.6

  • logical implication: “Whenever toothache is true, conclude that cavity is true with probability 0.6” wrong semantics of the given assertion
  • conditional implication: “Whenever toothache is true and we have no further relevant information, conclude that cavity is true with probability 0.6” correct semantics of the given assertion
    the extra condition is important: e.g. if we had the further relevant information that the dentist found no cavities, we definitely would not want to conclude that cavity is true with probability 0.6; instead we need to use P(cavity | toothache ∧ ¬cavity) = 0
Conditional Probability in terms of Unconditional Probabilities

posterior conditional probability is defined in terms of prior unconditional probabilities as follows

  • P(a | b) is a conditional probability
  • P(a ʌ b) and P(b) are prior unconditional probabilities
Product Rule

the formula above can be written in a different form called the product rule (also called chain rule when product rule is used more than once)

The Language of Propositions in Probability Assertions

propositions describing sets of possible worlds:

  • are written in a notation that combines elements of propositional logic and constraint satisfaction notation
  • it is a factored representation, in which a possible world is represented by a set of variable/value pairs
Random Variable & its Domain

every random variable has a domain—the set of possible values it can take on

domains can either be:

  • finite or infinite (e.g. boolean is finite, integers and reals are infinite)
  • discrete or continuous (e.g. integers are discrete, reals are continuous)
  • nominal or ordinal? not sure…

elementary propositions can be combined in such the following example:

  • The probability that the patient has a cavity, given that she is a teenager with no toothache, is 0.1

    P(cavity | ¬toothache ∧ teen ) = 0.1

Probability Distribution

probabilities of all the possible values of a random variable. We could write:

  • P(Weather = sunny) = 0.6
  • P(Weather = rain) = 0.1
  • P(Weather = cloudy) = 0.29
  • P(Weather = snow) = 0.01

but as an abbreviation we will allow

  • P(Weather) = ⟨0.6, 0.1, 0.29, 0.01⟩

where the bold P indicates that the result is a vector of numbers, and where we assume a predefined ordering ⟨sunny,rain,cloudy,snow⟩ on the domain of Weather

P statement defines a probability distribution for the random variable Weather

Conditional Distribution

given: P(X|Y) gives the values of P(X=xi|Y=yj) for each possible i,j pair

P statement defines a conditional distribution for the random variable X given Y

Probability Density Function

a random variable of infinite domain cannot be represented as a probability distribution

instead, we can define the probability that a random variable takes on some value x as a parameterized function of x

for example, the sentence:

P(NoonTemp=x) = Uniform[18C,26C](x) expresses the belief that the temperature at noon is distributed uniformly between 18 and 26 degrees Celsiusthe intuitive definition of P(x) is the probability that X falls within an arbitrarily small region beginning at x, divided by the width of the region:

P(x) = limdx→0P(x ≤ X ≤ x+dx)/dx

for NoonTemp we have:

P(NoonTemp=x) = Uniform[18,26](x) = if 18≤x≤26: 1/8 Centigrade; otherwise: 0;

in P(NoonTemp=20.18C) = 1/8 Centigrade, note 1/Centigrade that is not a probability, it is a probability density. The probability that NoonTemp is exactly 20.18C is zero, because 20.18C is a region of width 0

Joint Probability Distribution

joint probability distribution is a notation for distributions on multiple variablesP(X,Y) denotes the probabilities of all combinations of the values of X and Y

the product rules for all possible values of X and X can be written as a single equation:

P(X,Y) = P(X|Y)P(X)

Full Joint Probability Distribution

full joint probability distribution is a probability model that is completely determined by the joint distribution for all of the random variables

it can be used to find two other types of distributions: marginal distribution and conditional probability distribution

For example, if the variables are Cavity, Toothache, and Weather, then the full joint distribution is given by P(Cavity, Toothache, Weather). given that the variables: Cavity has 2 possible values, Toothache having 2, and Weather having 4, the joint distribution can be represented as a 2 × 2 × 4 table with 16 entries

example full joint probability distribution

Probability Axioms

AXIOM-1

first basic axiom of probability theory:

0 ≤ P(ω) ≤ 1 for every ω - every possible world has a probability between 0 and 1
ΣP(ω) = 1 - the total probability of the set of possible worlds is 1

AXIOM-2

for all propositional-events φ:

P(φ) = [ ΣP(ω) : ω∈φ ]

this is called marginalization

AXIOM-3

relationship between the probability of a proposition and the probability of its negation:

P(¬a) = Σω∈¬aP(ω)
P(¬a) = Σω∈¬aP(ω) + Σω∈aP(ω) − Σω∈aP(ω)
P(¬a) = Σω∈ΩP(ω) − Σω∈aP(ω)
P(¬a) = 1 - P(a)

AXIOM-4

inclusion–exclusion principle (Kolmogorov’s axioms):

P(a ∨ b) = P(a) + P(b) − P(a ∧ b)

AXIOM-5

controversial axiom is that degrees of belief must be numbers, or at least act like numbers in that they must be:

  • transitive - if belief in A is greater than belief in B, which is greater than belief in C, then belief in A must be greater than C
  • comparable - the belief in A must be one of equal to, greater than, or less than belief in B

Inference Using Full Joint Distributions

probabilistic inference - the computation of posterior probabilities of query propositions given observed evidence

we use full joint distribution as the “knowledge base” from which answers to all questions may be derived & several useful techniques for manipulating equations involving probabilities

Example Full Joint Distribution

the figure above shows a 2×2×2 full-joint-distribution table, of a domain consisting of three Boolean variables: Toothache, Cavity, and Catch

  1. notice AI Chapter 13 - Quantifying Uncertainty holds: as the sum of all probabilities equals 1

  2. notice AI Chapter 13 - Quantifying Uncertainty gives us a direct way to calculate the probability of any event/proposition: simply identify those possible worlds in which the proposition is true and add up their probabilities.

    for example, there are six possible worlds in which the event/proposition ‘cavity toothache’ holds:
    P(cavity toothache) = P(cavity ʌ toothache ʌ catch) + P(cavity ʌ toothache ʌ ¬catch) + 
                                          P(cavity ʌ ¬toothache ʌ catch) +
     P(cavity ʌ ¬toothache ʌ ¬catch) + 
                                          P(¬cavity ʌ toothache ʌ catch) + P(¬cavity ʌ toothache ʌ ¬catch)P(cavity toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064P(cavity toothache) = 0.28this is an example of using marginalization rule as explained below

Marginalization & Conditioning
  • sum of all probabilities where a random variable has the value x (this process is called marginalization or summing out)
  • e.g. marginal probability of Cavity=
                                  P(cavity ʌ ¬toothache ʌ catch) + P(cavity ʌ ¬toothache ʌ ¬catch)
    P(Cavity=cavity) = 0.108 + 0.012 + 0.072 + 0.008 
    P(Cavity=cavity) = 0.2
    cavity:
    P(Cavity=cavity) = P(cavity ʌ toothache ʌcatch) + P(cavity ʌ toothache ʌ ¬catch) + 
  • marginalization rule for any sets of variables Y and ZP(Y) = Σz∈ZP(Y,z) where Σz∈Zmeans the sum over all possible combinations of values of the set of variables Z
  • conditioning rule - a variant of marginalization rule that involves conditional probabilities instead of joint probabilities, using the product rule
    P(Y) = Σz∈ZP(Y|z)P(z)
Calculating Conditional Probabilities

conditional probabilities can be found by first using AI Chapter 13 - Quantifying Uncertainty and conditional probability (expressing conditional probabilities as un-conditional probabilities)

for example, to compute the probability of a cavity, given evidence of a toothache:
P(cavity|toothache) = P(cavity∧toothache) / P(toothache)P(cavity|toothache) = (0.108 + 0.012)        / (0.108 + 0.012 + 0.016 + 0.064)
P(cavity|toothache) = 0.6

we can also compute the probability that there is no cavity, given a toothache:
P(¬cavity|toothache) = P(¬cavity∧toothache) / P(toothache)P(¬cavity|toothache) = (0.016 + 0.064)         / (0.108 + 0.012 + 0.016 + 0.064)P(¬cavity|toothache) = 0.4

as you can see P(cavity|toothache)P(¬cavity|toothache) = 1

Normalization Constant
  • notice that 1/P(Toothache=toothache) remains constant, for all values of the random variable Cavity. In fact, 1/P(Toothache=toothache) is the normalization constant for the conditional distribution P(Cavity | ToothAche=toothache)
  • normalization constant is denoted as α

with α the preceding equations can be represented as one:P(Cavity | toothache) = α P(Cavity, toothache)
P(Cavity | toothache) = α [P(Cavity,toothache,catch) + P(Cavity,toothache,¬catch)]
P(Cavity | toothache) = α [⟨0.108,0.016⟩+⟨0.012,0.064⟩]
P(Cavity | toothache) = α ⟨0.12,0.08⟩
P(Cavity | toothache) = ⟨0.6,0.4⟩

where: α = 1/P(Toothache=toothache)

General Inference Procedure
single variable general inference procedure

P(X|e) = α P(X,e) = α ΣyP(X,e,y)

where:

X - a single random variable
E - the list of evidence variables
e - the list of observed values
Y - the list of remaining unobserved variables
y - the list of remaining unobserved values

Given the full joint distribution to work with, the single variable general inference procedure can answer probabilistic queries for discrete variables

General Inference Procedure - Scaling Problem

It does not scale well, however: for a domain described by n Boolean variables, it requires:

  • an input table of size O(2n)
  • takes O(2n) time to process the table

The full joint distribution in tabular form is just not a practical tool for building reasoning systems. Instead, it should be viewed as the theoretical foundation on which more effective approaches may be built, just as truth tables formed a theoretical foundation for more practical algorithms like DPLL

Independence

let us expand the full joint distribution in Figure 13.3 by adding a fourth random variable, Weather. The full joint distribution then becomes P(Toothache, Catch, Cavity, Weather), which has 2×2×2×4=32 entries

how does P(toothache, catch, cavity, cloudy) and P(toothache, catch, cavity) related?

  • product ruleP(toothache, catch, cavity, cloudy) = P(cloudy | toothache, catch, cavity) P(toothache, catch, cavity)
  • it is safe to say that the weather does not influence the dental variables (this is called independence, marginal independence, and absolute independence)
  • independence: P(cloudy | toothache, catch, cavity) = P(cloudy)
  • product rule & independence: P(toothache, catch, cavity, cloudy) = P(cloudy) P(toothache, catch, cavity) 

In fact, we can write the general equation

P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather)

Independence between propositions a and b can be written as

P(a|b)=P(a) or P(b|a)=P(b) or P(a∧b)=P(a)P(b)

Independence between variables X and Y can be written as

P(X|Y)=P(X) or P(Y|X)=P(Y) or P(X,Y)=P(X)P(Y)

If the complete set of variables can be divided into independent subsets, then the full joint distribution can be factored into separate joint distributions on those subsets.

For example, the full joint distribution on the outcome of n independent coin flips, P(C1, … , Cn), has 2nentries, but it can be represented as the product of n single-variable distributions P(Ci).

Bayes’ Rule and Its Use

the product rule can be written in 2 ways:

  • P(a ∧ b) = P(a|b) P(b)
  • P(a ∧ b) = P(b|a) P(a)

equating the two right-hand sides, we get (Bayes’ ruleBayes’ law, or Bayes’ theorem):

  • P(b|a) = P(a|b) P(b) / P(a)

general case of Bayes’ rule for multivalued variables written in P notation:

  • P(Y|X) = P(X|Y) P(Y) / P(X)

a more general version conditionalized on some background evidence e:

  • P(Y|X,e) = P(X|Y,eP(Y|e) / P(X|e)
diagram of bayes’ theorem

with: prior and posterior probabilities and normalization constant

Example Use

For example, a doctor knows that the disease meningitis causes the patient to have a stiff neck, say, 70% of the time. The doctor also knows some unconditional facts: the prior probability that a patient has meningitis is 1/50,000, and the prior probability that any patient has a stiff neck is 1%. Letting s be the proposition that the patient has a stiff neck and m be the proposition that the patient has meningitis, we have:

  • P(s|m) = 0.7
  • P(m) = 1/50000
  • P(s) = 0.01
  • P(m|s) = P(s|m)P(m) / P(s) = 0.7×1/50000 = 0.0014

That is, we expect less than 1 in 700 patients with a stiff neck to have meningitis

Using Bayes’ Rule: Combining Evidence

What happens when we have two or more pieces of evidence?

for example, what can a dentist conclude if her nasty steel probe catches in the aching tooth of a patient?
P(Cavity | toothache ∧ catch)

2 approaches in computing P(Cavity | toothache ∧ catch):

  • using joint probabilities
  • using conditional probabilities

using joint probabilities: If we know the full joint distribution (Figure 13.3), we can read off the answer:
P(Cavity | toothache ∧ catch) = P(Cavity ∧ toothache ∧ catch) / P(toothache ∧ catch)
P(Cavity | toothache ∧ catch) = P(cavity ∧ toothache ∧ catch), P(¬cavity ∧ toothache ∧ catch)⟩ / P(toothache ∧ catch)P(Cavity | toothache ∧ catch) = ⟨0.108, 0.016⟩ αP(Cavity | toothache ∧ catch0.871, 0.129⟩

We know, however, that this approach does not scale up to larger numbers of variables (e.g. P(X | Y1, …, Yn))

using conditional probabilities: We can try using Bayes’ rule to reformulate the problem to use conditional probabilities:
P(Cavity | toothache ∧ catch) = α P(toothache ∧ catch | Cavity) P(Cavity)

For this reformulation to work, we need to know the conditional probabilities of the conjunction toothache ∧ catch for each value of Cavity

That might be feasible for just two evidence variables, but again it does not scale up. If there are n possible evidence variables (X rays, diet, oral hygiene, etc.), then there are 2npossible combinations of observed values for which we would need to know conditional probabilities

Since both of these approaches are not expandable, we have 2 other options:

  1. approximate methods for evidence combination that, while giving incorrect answers, require fewer numbers to give any answer at all
  2. find some additional assertions about the domain that will enable us to simplify the expressions
    • conditional independence
Conditional Independence - Random Variables Becoming Independent Given Evidence

It would be nice if Toothache and Catch were independent, but they are not: if the probe catches in the tooth, then it is likely that the tooth has a cavity and that the cavity causes a toothache. These variables are independent, how- ever, given the presence or the absence of a cavity. Each is directly caused by the cavity, but neither has a direct effect on the other: toothache depends on the state of the nerves in the tooth, whereas the probe’s accuracy depends on the dentist’s skill, to which the toothache is irrelevant. Mathematically, this property is written as

P(toothache ∧ catch | Cavity) = P(toothache | Cavity) P(catch | Cavity)

general definition of conditional independence of two variables X and Y , given a third variable Z, is
P(X,Y|Z) = P(X|ZP(Y|Z)

as with absolute independence the equivalent forms

  • P(X|Y,Z)=P(X|Z) - knowing Y when given Z does not change to probability of X
  • P(Y|X,Z)=P(Y|Z) - knowing X when given Z does not change to probability of Y

for n symptoms that are all conditionally independent given Cavity, the size of the representation grows as O(n) instead of O(2n). That means that conditional independence assertions can allow probabilistic systems to scale up; moreover, they are much more commonly available than absolute independence assertions

Naive Bayes Model or Bayesian classifier or Idiot Bayes Model

assumes the conditional independence of all effect variables, given a single cause variable, and grows linearly with the number of effects - the full joint distribution can be written as:

P(Cause, Effect1, …, Effectn) = P(Cause) ∏inP(Effecti| Cause)