What is ERM?

𝐿𝑆(β„Ž) is the training error (aka empirical error and empirical risk) defined as:

  • 𝐿𝑆(β„Ž) = (1/π‘š) * |{(π‘₯,𝑦)βˆŠπ‘† : β„Ž(π‘₯) β‰  𝑦}|

where:

  • 𝑆 = {(π‘₯1,𝑦1), (π‘₯2,𝑦2), …, (π‘₯π‘š,π‘¦π‘š)} is the training set of size π‘š

Since 𝑆 is a snapshot of the world, it makes sense to search for a predictor β„Ž that minimizes 𝐿𝑆(β„Ž). This is called Empirical Risk Minimization (ERM).

ERM is formally defined as:

Let β„Žπ‘† denote the result of applying 𝐸𝑅𝑀𝐻 to 𝑆:

What Could Go Wrong?

TLDR: overfitting

  • 𝐿𝑆(β„Ž) is the empirical risk
  • 𝐿(π’Ÿ,𝑓)(β„Ž) is the true risk

where:

  • π’Ÿ is the unknown true distribution of 𝑆
  • 𝑓 is the unknown true β€œhypothesis”

Also since we cannot guarantee perfect label prediction, we introduce the accuracy parameter commonly denoted as πœ€.

We interpret:

  • 𝐿(π’Ÿ,𝑓)(β„Žπ‘†) ≀ πœ€ as an approximately correct predictor
  • 𝐿(π’Ÿ,𝑓)(β„Žπ‘†) > πœ€ as a failure of the learner

Upper Bound of the Probability of Leaner Failure

We would like to upper bound the probability to sample π‘š-tuple of instances that will lead to failure of the learner:

where:

  • π’Ÿπ‘š denotes the probability over π‘š-tuples induced by applying π’Ÿ to pick each element of the tuple independently of the other members of the tuple
  • 𝑆|π‘₯ = (π‘₯1, …, π‘₯π‘š) is a training set, i.e. an π‘š-tuple of instances i.i.d. from π’Ÿ

The upper bound is defined as:

In other words, it says that the probability that the hypothesis space contains a hypothesis where its training-error is 0 and its true-error is >πœ€, is lower than |𝐻|𝑒-πœ€π‘š. Where π‘š is the size of the training set. Mathematically:

  • 𝐏(βˆƒβ„ŽβˆŠπ» s.t. π‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘”-π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ(β„Ž) = 0 AND π‘‘π‘Ÿπ‘’π‘’-π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ(β„Ž) > πœ€) ≀ |𝐻|𝑒-πœ€π‘š

The assumptions are that:

  • the hypothesis space 𝐻 is finite
  • training samples are i.i.d.

Where Does Probably Come From?

We can bound the probability |𝐻|𝑒-πœ€π‘šΒ from above:

From this, we can calculate the number of samples we need for a set of hypothesis 𝐻 to be approximately correct with the predefined probability 𝛿:

So as we increase the size of the data π‘š we could:

  • decrease the error rate πœ€
  • decrease the probability 𝛿

Agnostic PAC Learning

Agnostic PAC learning considers the case where the hypothesis space 𝐻 is inconsistent with the training data. In other words, the realizability assumption is lifted.

This means the error rate of the hypothesis set on the training data is non-zero. In this case, we have:

From the above inequality, we can find the sample complexity in agnostic PAC learning to be:

PAC Learnability and VC Dimension

As we saw above, PAC learnability for a concept class 𝐻 holds if the sample complexity π‘š is a polynomial function of (1/πœ€), (1/𝛿), and |𝐻| the size of the concept class.

VC dimension is the maximum number of points a hypothesis can shatter (i.e. separate differently labeled points for any labeling).

PAC learnability and VC dimension are closely related:

  • 𝐻 is agnostically PAC-learnable iff 𝐻 has a finite VC dimension

If VC dimension of 𝐻 is finite, then the sample complexity π‘š can be computed as follows:

where:

  • πœ€ is the learner’s maximum error with the 1-𝛿 probability

Examples

Resources