Many problems in language processing can be viewed as noisy channel problems
- Optical Character Recognition
- Spelling Correction
- Speech recognition
- Machine translation
The Noisy Channel Model of Spelling
π ππ’πππ-π€πππ β noisy-channel β ππππ π¦-π€πππ
find the most probable π ππ’πππ-π€πππ given the observed ππππ π¦-π€πππ:
- π ππ’πππ-π€πππΛ = ππππππ₯π ππ’πππ-π€πππ π(π ππ’πππ-π€πππ|ππππ π¦-π€πππ)
- π ππ’πππ-π€πππΛ =Β ππππππ₯π ππ’πππ-π€πππΒ π(ππππ π¦-π€πππ|π ππ’πππ-π€πππ)π(π ππ’πππ-π€πππ)/π(ππππ π¦-π€πππ) #Β Bayesβ Theorem
- π ππ’πππ-π€πππΛ =Β ππππππ₯π ππ’πππ-π€πππΒ π(ππππ π¦-π€πππ|π ππ’πππ-π€πππ)π(π ππ’πππ-π€πππ) #Β π(ππππ π¦-π€πππ) is a constant w.r.t.Β ππππππ₯π ππ’πππ-π€πππ
where:
- π(π ππ’πππ-π€πππ) - language model (prior probability)
- π(ππππ π¦-π€πππ|π ππ’πππ-π€πππ) - noisy channel model or error model (likelihood)
Language Model Probability
- unigram, bigram, trigram, n-gram
- web-scale spelling correction
- stupid backoff
Noisy Channel Model - Problems
It is fruitless to try to collect statistics about the misspellings of individual words given a large dictionary. Youβll likely never get enough data.
We need a way to compute π(ππππ π¦-π€πππ|π ππ’πππ-π€πππ)Β without using direct information.
This is where Edit Distance come in
Edit Distance
Damerau-Levenshtein Edit Distance is the minimal edit distance between 2 strings, where edits are:
- deletion
- thereΒ β ther
- insertion (also allow insertion of space or hyphen)
- the β ther
- substitution
- nowΒ β noq
- transposition of 2 adjacent characters
- theΒ β teh
Candidate Generation
- 80% of errors are within 1 edit distance
- ~100% of errors are within 2 edit distance
Learning
Collect statistics for each error type from a large corpus.
For example, asking for π(acress|actress) is assumed to be the same as asking for the probability that a deletion of t happened hereΒ π(c|ct)
So just collect a large corpus of text (containing errors)Β and see how often t gets deleted
Channel/Error Model Probability
π(π₯|π¦) = probability of edit
where:
- πππ[π₯,π¦]: count ofΒ π₯ in noisy training set when it should beΒ π₯π¦
- πππ [π₯,π¦]: count of π₯π¦ in noisy training set when it should beΒ π₯
- π π’π[π₯,π¦]: count ofΒ π¦ in noisy training set when it should beΒ π₯?????
- π‘ππ[π₯,π¦]: count ofΒ π¦π₯ in noisy training set when it should be π₯π¦
single edit:
- π(π₯|π₯π¦) = πππ[π₯,π¦] / πππ’ππ‘-π ππ’πππ[π₯,π¦]Β # if deletion
- π(π₯π¦|π₯) = πππ [π₯,π¦] / πππ’ππ‘-π ππ’πππ[π₯]Β # if insertion
- π(π₯|π¦) = π π’π[π₯,π¦] / πππ’ππ‘-π ππ’πππ[π¦]Β # if substitution
- π(π¦π₯|π₯π¦) = π‘ππ[π₯,π¦] / πππ’ππ‘-π ππ’πππ[π₯,π¦]Β # if transposition
let:
- π be correct π ππ’πππ-π€πππ whereΒ π =Β [π 1, β¦,Β π π]
- π beΒ misspelled ππππ π¦-π€πππ where π =Β [π 1, β¦,Β π π]
for single edit:
- π(π|π ) = πππ[π π-1,π π] / πππ’ππ‘[π π-1,π π] # if deletion
- π(π|π ) = πππ [π π-1,ππ] / πππ’ππ‘[π π-1] # if insertion
- π(π|π ) = π π’π[ππ,π π] / πππ’ππ‘[π π] # if substitution
- π(π|π ) = π‘ππ[π π,π π+1] / πππ’ππ‘[π π,π π+1] # if transposition
Noisy Channel Probability Model For βacressβ
- π = acress
|
Candidate Correction π =? |
Correct Letter |
Error Letter |
π|π |
π(π|π ) |
π(π ) |
π(π|π )π(π )*109 |
|---|---|---|---|---|---|---|
|
|
|
|
|
0.000117 |
0.0000231 |
2.7027 |
|
|
|
|
|
0.00000144 |
0.000000544 |
0.00078336 |
|
|
|
|
|
0.00000164 |
0.00000170 |
0.002788 |
|
|
|
|
|
0.00000209 |
0.0000916 |
0.191444 |
|
|
|
|
|
0.0000093 |
0.000299 |
2.7807 |
|
|
|
|
|
0.0000321 |
0.0000318 |
1.02078 |
|
|
|
|
|
0.0000342 |
0.0000318 |
1.08756 |
thus,Β acress would be corrected toΒ across
Using a Bigram Language Model
using a unigram language model is not as good as using a bigram model
for example:
- β
β¦ a versatile acress whose combination β¦β - what would acress be corrected to?
- a unigram language model may correct it to
acrossΒ however the sentence wonβt make much senseactressΒ would be better - a bigram model would be better
- a unigram language model may correct it to
- π(
actress|versatile) = 0.000021 - π(
whose|actress) = 0.0010 - π(
across|versatile) = 0.000021 - π(
whose|across) = 0.000006 - π(β
versatile actress whoseβ) = 0.000021 * 0.0010 = 210x10β»ΒΉβ° - π(β
versatile across whoseβ) = 0.000021 * 0.000006 = 1.26x10β»ΒΉβ°