What is ERM?
πΏπ(β) is the training error (aka empirical error and empirical risk) defined as:
- πΏπ(β) = (1/π) * |{(π₯,π¦)βπ : β(π₯) β π¦}|
where:
- π = {(π₯1,π¦1), (π₯2,π¦2), β¦, (π₯π,π¦π)} is the training set of size π
Since π is a snapshot of the world, it makes sense to search for a predictor β that minimizesΒ πΏπ(β). This is called Empirical Risk Minimization (ERM).
ERM is formally defined as:
Let βπ denote the result of applying πΈπ ππ» to π:
What Could Go Wrong?
TLDR: overfitting
- πΏπ(β) is the empirical risk
- πΏ(π,π)(β) is the true risk
where:
- π is the unknown true distribution of π
- π is the unknown true βhypothesisβ
Also since we cannot guarantee perfect label prediction, we introduce the accuracy parameter commonly denoted as π.
We interpret:
- πΏ(π,π)(βπ) β€ π as an approximately correct predictor
- πΏ(π,π)(βπ) > π as a failure of the learner
Upper Bound of the Probability of Leaner Failure
We would like to upper bound the probability to sample π-tuple of instances that will lead to failure of the learner:
where:
- ππ denotes the probability over π-tuples induced by applying π to pick each element of the tuple independently of the other members of the tuple
- π|π₯ = (π₯1, β¦, π₯π) is a training set, i.e. an π-tuple of instances i.i.d. from π
The upper bound is defined as:
PROOF
Let the set of βbadβ hypothesis be:
Let the set of misleading examples be:
Since, πΏπ(βπ) = 0, the event πΏ(π,π)(βπ) > π can only happen if for some ββπ»π΅ we have πΏπ(β) = 0. In other words, this event will only happen if our sample is in the set of misleading samples π. Formally it is defined as:
Which can be rewritten as:
Hence:
- \mathcal{D}^m(\{ S|_x : L_S(h) = 0 \}) = \mathcal{D}^m(\{ S|_x : \forall_i, h(x_i) = f(x_i) \}) \;\;\; \text{ The event πΏ_π(β) = 0 is equivalent to the event βπ, β(π₯_π) = π(π₯_π)}
[!info]the rest below is supplemental from: https://www.baeldung.com/cs/probably-aproximately-correct
In other words, it says that the probability that the hypothesis space contains a hypothesis where its training-error is 0 and its true-error is >π, is lower than |π»|π-ππ. Where π is the size of the training set. Mathematically:
- π(βββπ» s.t. π‘πππππππ-πππππ(β) = 0 AND π‘ππ’π-πππππ(β) > π) β€ |π»|π-ππ
The assumptions are that:
- the hypothesis space π» is finite
- training samples are i.i.d.
Where Does Probably Come From?
We can bound the probability |π»|π-ππΒ from above:
From this, we can calculate the number of samples we need for a set of hypothesis π» to be approximately correct with the predefined probability πΏ:
So as we increase the size of the data π we could:
- decrease the error rate π
- decrease the probability πΏ
Agnostic PAC Learning
Agnostic PAC learning considers the case where the hypothesis space π» is inconsistent with the training data. In other words, the realizability assumption is lifted.
This means the error rate of the hypothesis set on the training data is non-zero. In this case, we have:
From the above inequality, we can find the sample complexity in agnostic PAC learning to be:
PAC Learnability and VC Dimension
As we saw above, PAC learnability for a concept class π» holds if the sample complexity π is a polynomial function of (1/π), (1/πΏ), and |π»| the size of the concept class.
VC dimension is the maximum number of points a hypothesis can shatter (i.e. separate differently labeled points for any labeling).
PAC learnability and VC dimension are closely related:
- π» is agnostically PAC-learnable iff π» has a finite VC dimension
If VC dimension of π» is finite, then the sample complexity π can be computed as follows:
where:
- π is the learnerβs maximum error with the 1-πΏ probability
Examples
Class of 2D Rectangles
The set of axis-aligned rectangles in a 2D space is PAC-learnable.
To show this, itβs sufficient to find the sample complexity of this hypothesis set. And to do that, we can find its VC dimension.
From the figures below, we can see that a rectangle can separate 2, 3, and 4 data points with ANY labeling
No matter how these points are labeled, we can always place a rectangle that separates differently labeled points.
However, when there are five points, shattering them with a rectangle is impossible. As a result, the VC dimension of axis-aligned rectangles is 4.
Using this, we can calculate the sample complexity with arbitrary \epsilon and \delta.
So, the class of 2D rectangles is PAC-learnable.
Class of Polynomial Classifiers in β
A classifier in a one-dimensional line can shatter at most 2 points, and a line in two-dimensional space can shatter at most 3 points. Similarly, the VC dimension of a polynomial classifier of degree π is π+1. As a result, each finite polynomial is PAC-learnable.
However, the class of all polynomial classifiers (i.e., their union) has a VC dimension of β. Therefore, the union of polynomial classifiers is not PAC-learnable.
So, although any set of polynomials with the same finite degree is learnable, their union isnβt.
---cognitive-computing---machine-intelligence/ai---subfields/computational-learning-theory-(colt)/colt---textbooks/understanding-machine-learningοΌ-from-theory-to-algorithms/chapter-2---empirical-risk-minimization-(erm)/vc-dimension-rectangular-boxes-2.png)
---cognitive-computing---machine-intelligence/ai---subfields/computational-learning-theory-(colt)/colt---textbooks/understanding-machine-learningοΌ-from-theory-to-algorithms/chapter-2---empirical-risk-minimization-(erm)/vc-dimension-rectangular-boxes-3.png)
---cognitive-computing---machine-intelligence/ai---subfields/computational-learning-theory-(colt)/colt---textbooks/understanding-machine-learningοΌ-from-theory-to-algorithms/chapter-2---empirical-risk-minimization-(erm)/vc-dimension-rectangular-boxes-4.png)
---cognitive-computing---machine-intelligence/ai---subfields/computational-learning-theory-(colt)/colt---textbooks/understanding-machine-learningοΌ-from-theory-to-algorithms/chapter-2---empirical-risk-minimization-(erm)/vc-dimension-rectangular-boxes-5.png)