What is ERM?

𝐿_𝑆(ℎ) is the training error (aka empirical error and empirical risk) defined as:

𝐿_𝑆(ℎ) = (1/𝑚) * |{(𝑥,𝑦)∊𝑆 : ℎ(𝑥) ≠ 𝑦}|

where:

𝑆 = {(𝑥₁,𝑦₁), (𝑥₂,𝑦₂), …, (𝑥_𝑚,𝑦_𝑚)} is the training set of size 𝑚

Since 𝑆 is a snapshot of the world, it makes sense to search for a predictor ℎ that minimizes 𝐿_𝑆(ℎ). This is called Empirical Risk Minimization (ERM).

ERM is formally defined as:

$E R M_{H} (S) ∊ a r g mi n_{h ∊ H} L_{S} (h)$

Let ℎ_𝑆 denote the result of applying 𝐸𝑅𝑀_𝐻 to 𝑆:

$h_{S} ∊ E R M_{H} (S)$

What Could Go Wrong?

TLDR: overfitting

𝐿_𝑆(ℎ) is the empirical risk
𝐿_(𝒟,𝑓)(ℎ) is the true risk

where:

𝒟 is the unknown true distribution of 𝑆
𝑓 is the unknown true “hypothesis”

Also since we cannot guarantee perfect label prediction, we introduce the accuracy parameter commonly denoted as 𝜀.

We interpret:

𝐿_(𝒟,𝑓)(ℎ_𝑆) ≤ 𝜀 as an approximately correct predictor
𝐿_(𝒟,𝑓)(ℎ_𝑆) > 𝜀 as a failure of the learner

Upper Bound of the Probability of Leaner Failure

We would like to upper bound the probability to sample 𝑚-tuple of instances that will lead to failure of the learner:

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε)$

where:

𝒟^𝑚 denotes the probability over 𝑚-tuples induced by applying 𝒟 to pick each element of the tuple independently of the other members of the tuple
𝑆|_𝑥 = (𝑥₁, …, 𝑥_𝑚) is a training set, i.e. an 𝑚-tuple of instances i.i.d. from 𝒟

The upper bound is defined as:

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε) \leq ∣ H ∣ e^{- 𝜀 m}$

PROOF

Let the set of “bad” hypothesis be:

$H_{B} = {h ∊ H : L_{(D, f} (h) > 𝜀}$

Let the set of misleading examples be:

$M = {S ∣_{x} : \exists h ∊ H_{B}, L_{S} (h) = 0}$

Since, 𝐿_𝑆(ℎ_𝑆) = 0, the event 𝐿_(𝒟,𝑓)(ℎ_𝑆) > 𝜀 can only happen if for some ℎ∊𝐻_𝐵 we have 𝐿_𝑆(ℎ) = 0. In other words, this event will only happen if our sample is in the set of misleading samples 𝑀. Formally it is defined as:

${S ∣_{x} : L_{(D, f)} (h_{S}) > 𝜀} \subseteq M$

Which can be rewritten as:

$M = ⋃_{h ∊ H_{B}} {S ∣_{x} : L_{S} (h) = 0}$

Hence:

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε) \leq D^{m} (M)$

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε) \leq D^{m} (⋃_{h ∊ H_{B}} {S ∣_{x} : L_{S} (h) = 0})$

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε) \leq \sum_{h ∊ H_{B}} D^{m} ({S ∣_{x} : L_{S} (h) = 0}) because of union bound of probabilities$

$\mathcal{D}^m(\{ S|_x : L_S(h) = 0 \}) = \mathcal{D}^m(\{ S|_x : \forall_i, h(x_i) = f(x_i) \}) \;\;\; \text{ The event 𝐿_𝑆(ℎ) = 0 is equivalent to the event ∀𝑖, ℎ(𝑥_𝑖) = 𝑓(𝑥_𝑖)}$

$D^{m} ({S ∣_{x} : L_{S} (h) = 0}) = \prod_{i = 1}^{m} D ({x_{i} : h (x_{i}) = f (x_{i})}) because training set are sampled i.i.d.$

$D ({x_{i} : h (x_{i}) = y_{i}}) = 1 - L_{(D, f)} (h)$

$D ({x_{i} : h (x_{i}) = y_{i}}) \leq 1 - 𝜀$

$D^{m} ({S ∣_{x} : L_{S} (h) = 0}) \leq \prod_{i = 1}^{m} (1 - 𝜀)$

$D^{m} ({S ∣_{x} : L_{S} (h) = 0}) \leq (1 - 𝜀)^{m}$

$D^{m} ({S ∣_{x} : L_{S} (h) = 0}) \leq e^{- 𝜀 m} via the inequality 1 - 𝜀 \leq e^{- 𝜀}$

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε) \leq \sum_{h ∊ H_{B}} D^{m} ({S ∣_{x} : L_{S} (h) = 0})$

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε) \leq \sum_{h ∊ H_{B}} e^{- 𝜀 m}$

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε) \leq ∣ H_{B} ∣ e^{- 𝜀 m}$

$D^{m} (S ∣_{x} : L_{(D, f)} (h_{S}) > ε) \leq ∣ H ∣ e^{- 𝜀 m}$
[!info]

the rest below is supplemental from: https://www.baeldung.com/cs/probably-aproximately-correct

In other words, it says that the probability that the hypothesis space contains a hypothesis where its training-error is 0 and its true-error is >𝜀, is lower than |𝐻|𝑒^-𝜀𝑚. Where 𝑚 is the size of the training set. Mathematically:

𝐏(∃ℎ∊𝐻 s.t. 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔-𝑒𝑟𝑟𝑜𝑟(ℎ) = 0 AND 𝑡𝑟𝑢𝑒-𝑒𝑟𝑟𝑜𝑟(ℎ) > 𝜀) ≤ |𝐻|𝑒^-𝜀𝑚

The assumptions are that:

the hypothesis space 𝐻 is finite
training samples are i.i.d.

Where Does Probably Come From?

We can bound the probability |𝐻|𝑒^-𝜀𝑚 from above:

$∣ H ∣ e^{- 𝜀 m} \leq 𝛿$

From this, we can calculate the number of samples we need for a set of hypothesis 𝐻 to be approximately correct with the predefined probability 𝛿:

$m \geq \frac{1}{𝜀} (l n (∣ H ∣) + l n (\frac{1}{𝛿}))$

So as we increase the size of the data 𝑚 we could:

decrease the error rate 𝜀
decrease the probability 𝛿

Agnostic PAC Learning

Agnostic PAC learning considers the case where the hypothesis space 𝐻 is inconsistent with the training data. In other words, the realizability assumption is lifted.

This means the error rate of the hypothesis set on the training data is non-zero. In this case, we have:

$P (true-error (h) > training-error (h) + 𝜀) \leq ∣ H ∣ e^{- 2 m 𝜀^{2}}$

From the above inequality, we can find the sample complexity in agnostic PAC learning to be:

$m \geq \frac{1}{2 𝜀 ^{2}} (l n (∣ H ∣) + l n (\frac{1}{𝛿}))$

PAC Learnability and VC Dimension

As we saw above, PAC learnability for a concept class 𝐻 holds if the sample complexity 𝑚 is a polynomial function of (1/𝜀), (1/𝛿), and |𝐻| the size of the concept class.

VC dimension is the maximum number of points a hypothesis can shatter (i.e. separate differently labeled points for any labeling).

PAC learnability and VC dimension are closely related:

𝐻 is agnostically PAC-learnable iff 𝐻 has a finite VC dimension

If VC dimension of 𝐻 is finite, then the sample complexity 𝑚 can be computed as follows:

$m > \frac{1}{𝜀} (8 V C (H) l o g (\frac{13}{𝜀}) + 4 l o g (\frac{2}{𝛿}))$

where:

𝜀 is the learner’s maximum error with the 1-𝛿 probability

Examples

Class of 2D Rectangles

The set of axis-aligned rectangles in a 2D space is PAC-learnable.

To show this, it’s sufficient to find the sample complexity of this hypothesis set. And to do that, we can find its VC dimension.

From the figures below, we can see that a rectangle can separate 2, 3, and 4 data points with ANY labeling

No matter how these points are labeled, we can always place a rectangle that separates differently labeled points.

However, when there are five points, shattering them with a rectangle is impossible. As a result, the VC dimension of axis-aligned rectangles is 4.

Using this, we can calculate the sample complexity with arbitrary \epsilon and \delta.

So, the class of 2D rectangles is PAC-learnable.

Class of Polynomial Classifiers in ℝ

A classifier in a one-dimensional line can shatter at most 2 points, and a line in two-dimensional space can shatter at most 3 points. Similarly, the VC dimension of a polynomial classifier of degree 𝑛 is 𝑛+1. As a result, each finite polynomial is PAC-learnable.

However, the class of all polynomial classifiers (i.e., their union) has a VC dimension of ∞. Therefore, the union of polynomial classifiers is not PAC-learnable.

So, although any set of polynomials with the same finite degree is learnable, their union isn’t.

／var／log marcus chiu

Explorer

Chapter 2 - Empirical Risk Minimization (ERM)

What is ERM?

What Could Go Wrong?

Upper Bound of the Probability of Leaner Failure

Where Does Probably Come From?

Agnostic PAC Learning

PAC Learnability and VC Dimension

Examples

Resources

／var／logmarcus chiu

Explorer

Chapter 2 - Empirical Risk Minimization (ERM)

What is ERM?

What Could Go Wrong?

Upper Bound of the Probability of Leaner Failure

Where Does Probably Come From?

Agnostic PAC Learning

PAC Learnability and VC Dimension

Examples

Resources

／var／log marcus chiu