- extrinsic evaluation
- intrinsic evaluation
- perplexity - the probability of the test-set, normalized by the number of words
Extrinsic (in-vivo) Evaluation
extrinsic evaluation is the best for comparing 2 language models 𝐴 and 𝐵:
- put each model in a task (spelling corrector, speech recognizer, etc)
- get the accuracy of 𝐴 and 𝐵
- how many misspelled words were corrected properly
- how many words were translated correctly
- compare the accuracy of 𝐴 and 𝐵
downside is time-consuming
Intrinsic Evaluation - Perplexity
- 𝐏𝐏(𝑊) = 𝐏(𝑤1, …, 𝑤𝑛)-(1/𝑛)
- 𝐏𝐏(𝑊) = [𝛱1≤𝑖≤𝑛𝐏(𝑤𝑖|𝑤1, …, 𝑤𝑖-1)]-(1/𝑛)# chain rule
- 𝐏𝐏(𝑊) = [𝛱1≤𝑖≤𝑛𝐏(𝑤𝑖|𝑤𝑖-1)]-(1/𝑛)# bi-grams
minimizing perplexity ≈ maximizing probability
perplexity:
- bad approximation unless the test-data looks just like the training-data
- is related to the weighted branching factor