Many problems in language processing can be viewed as noisy channel problems

Optical Character Recognition
Spelling Correction
Speech recognition
Machine translation

The Noisy Channel Model of Spelling

𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑 → noisy-channel → 𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑

find the most probable 𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑 given the observed 𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑:

𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑ˆ = 𝑎𝑟𝑔𝑚𝑎𝑥_{𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑} 𝐏(𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑|𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑)
𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑ˆ = 𝑎𝑟𝑔𝑚𝑎𝑥_{𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑} 𝐏(𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑|𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑)𝐏(𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑)/𝐏(𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑) # Bayes’ Theorem
𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑ˆ = 𝑎𝑟𝑔𝑚𝑎𝑥_{𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑} 𝐏(𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑|𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑)𝐏(𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑) # 𝐏(𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑) is a constant w.r.t. 𝑎𝑟𝑔𝑚𝑎𝑥_{𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑}

where:

𝐏(𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑) - language model (prior probability)
𝐏(𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑|𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑) - noisy channel model or error model (likelihood)

Language Model Probability

unigram, bigram, trigram, n-gram
web-scale spelling correction
- stupid backoff

Noisy Channel Model - Problems

It is fruitless to try to collect statistics about the misspellings of individual words given a large dictionary. You’ll likely never get enough data.

We need a way to compute 𝐏(𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑|𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑) without using direct information.

This is where Edit Distance come in

Edit Distance

Damerau-Levenshtein Edit Distance is the minimal edit distance between 2 strings, where edits are:

deletion
- there → ther
insertion (also allow insertion of space or hyphen)
- the → ther
substitution
- now → noq
transposition of 2 adjacent characters
- the → teh

Candidate Generation

80% of errors are within 1 edit distance
~100% of errors are within 2 edit distance

Learning

Collect statistics for each error type from a large corpus.

For example, asking for 𝐏(acress|actress) is assumed to be the same as asking for the probability that a deletion of t happened here 𝐏(c|ct)

So just collect a large corpus of text (containing errors) and see how often t gets deleted

Channel/Error Model Probability

𝐏(𝑥|𝑦) = probability of edit

where:

𝑑𝑒𝑙[𝑥,𝑦]: count of 𝑥 in noisy training set when it should be 𝑥𝑦
𝑖𝑛𝑠[𝑥,𝑦]: count of 𝑥𝑦 in noisy training set when it should be 𝑥
𝑠𝑢𝑏[𝑥,𝑦]: count of 𝑦 in noisy training set when it should be 𝑥?????
𝑡𝑟𝑎[𝑥,𝑦]: count of 𝑦𝑥 in noisy training set when it should be 𝑥𝑦

single edit:

𝐏(𝑥|𝑥𝑦) = 𝑑𝑒𝑙[𝑥,𝑦] / 𝑐𝑜𝑢𝑛𝑡-𝑠𝑜𝑢𝑟𝑐𝑒[𝑥,𝑦] # if deletion
𝐏(𝑥𝑦|𝑥) = 𝑖𝑛𝑠[𝑥,𝑦] / 𝑐𝑜𝑢𝑛𝑡-𝑠𝑜𝑢𝑟𝑐𝑒[𝑥] # if insertion
𝐏(𝑥|𝑦) = 𝑠𝑢𝑏[𝑥,𝑦] / 𝑐𝑜𝑢𝑛𝑡-𝑠𝑜𝑢𝑟𝑐𝑒[𝑦] # if substitution
𝐏(𝑦𝑥|𝑥𝑦) = 𝑡𝑟𝑎[𝑥,𝑦] / 𝑐𝑜𝑢𝑛𝑡-𝑠𝑜𝑢𝑟𝑐𝑒[𝑥,𝑦] # if transposition

let:

𝑠 be correct 𝑠𝑜𝑢𝑟𝑐𝑒-𝑤𝑜𝑟𝑑 where 𝑠 = [𝑠₁, …, 𝑠_𝑗]
𝑛 be misspelled 𝑛𝑜𝑖𝑠𝑦-𝑤𝑜𝑟𝑑 where 𝑛 = [𝑠₁, …, 𝑠_𝑘]

for single edit:

𝐏(𝑛|𝑠) = 𝑑𝑒𝑙[𝑠_𝑖-1,𝑠_𝑖] / 𝑐𝑜𝑢𝑛𝑡[𝑠_𝑖-1,𝑠_𝑖] # if deletion
𝐏(𝑛|𝑠) = 𝑖𝑛𝑠[𝑠_𝑖-1,𝑛_𝑖] / 𝑐𝑜𝑢𝑛𝑡[𝑠_𝑖-1] # if insertion
𝐏(𝑛|𝑠) = 𝑠𝑢𝑏[𝑛_𝑖,𝑠_𝑖] / 𝑐𝑜𝑢𝑛𝑡[𝑠_𝑖] # if substitution
𝐏(𝑛|𝑠) = 𝑡𝑟𝑎[𝑠_𝑖,𝑠_𝑖+1] / 𝑐𝑜𝑢𝑛𝑡[𝑠_𝑖,𝑠_𝑖+1] # if transposition

Noisy Channel Probability Model For “acress”

𝑛 = acress

Candidate Correction 𝑠=?	Correct Letter	Error Letter	𝑛\|𝑠	𝐏(𝑛\|𝑠)	𝐏(𝑠)	𝐏(𝑛\|𝑠)𝐏(𝑠)*10⁹
`actress`	`t`	`-`	`c\|ct`	0.000117	0.0000231	2.7027
`cress`	`-`	`a`	`a\|^`	0.00000144	0.000000544	0.00078336
`caress`	`ca`	`ac`	`ac\|ca`	0.00000164	0.00000170	0.002788
`access`	`c`	`r`	`r\|c`	0.00000209	0.0000916	0.191444
`across`	`o`	`e`	`e\|o`	0.0000093	0.000299	2.7807
`acres`	`-`	`s`	`es\|e`	0.0000321	0.0000318	1.02078
`acres`	`-`	`s`	`ss\|s`	0.0000342	0.0000318	1.08756

thus, acress would be corrected to across

Using a Bigram Language Model

using a unigram language model is not as good as using a bigram model

for example:

”… a versatile acress whose combination …”
what would acress be corrected to?
- a unigram language model may correct it to across however the sentence won’t make much sense actress would be better
- a bigram model would be better
𝐏(actress|versatile) = 0.000021
𝐏(whose|actress) = 0.0010
𝐏(across|versatile) = 0.000021
𝐏(whose|across) = 0.000006
𝐏(‘versatile actress whose’) = 0.000021 * 0.0010 = 210x10⁻¹⁰
𝐏(‘versatile across whose’) = 0.000021 * 0.000006 = 1.26x10⁻¹⁰

／var／log marcus chiu

Explorer

Spelling Error - Noisy-Channel Model

The Noisy Channel Model of Spelling

Language Model Probability

Noisy Channel Model - Problems

Edit Distance

Candidate Generation

Learning

Channel/Error Model Probability

Noisy Channel Probability Model For “acress”

Using a Bigram Language Model

／var／logmarcus chiu

Explorer

Spelling Error - Noisy-Channel Model

The Noisy Channel Model of Spelling

Language Model Probability

Noisy Channel Model - Problems

Edit Distance

Candidate Generation

Learning

Channel/Error Model Probability

Noisy Channel Probability Model For “acress”

Using a Bigram Language Model

／var／log marcus chiu