Many problems in language processing can be viewed as noisy channel problems

  • Optical Character Recognition
  • Spelling Correction
  • Speech recognition
  • Machine translation

The Noisy Channel Model of Spelling

π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘ β†’ noisy-channel β†’ π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘

find the most probable π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘ given the observed π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘:

  • π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘Λ† = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘ 𝐏(π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘|π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘)
  • π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘Λ† =Β π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘Β π(π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘|π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘)𝐏(π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘)/𝐏(π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘) #Β Bayes’ Theorem
  • π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘Λ† =Β π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘Β π(π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘|π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘)𝐏(π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘) # 𝐏(π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘) is a constant w.r.t.Β π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘

where:

  • 𝐏(π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘) - language model (prior probability)
  • 𝐏(π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘|π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘) - noisy channel model or error model (likelihood)

Language Model Probability

  • unigram, bigram, trigram, n-gram
  • web-scale spelling correction
    • stupid backoff

Noisy Channel Model - Problems

It is fruitless to try to collect statistics about the misspellings of individual words given a large dictionary. You’ll likely never get enough data.

We need a way to compute 𝐏(π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘|π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘)Β without using direct information.

This is where Edit Distance come in

Edit Distance

Damerau-Levenshtein Edit Distance is the minimal edit distance between 2 strings, where edits are:

  • deletion
    • thereΒ β†’ ther
  • insertion (also allow insertion of space or hyphen)
    • the β†’ ther
  • substitution
    • nowΒ β†’ noq
  • transposition of 2 adjacent characters
    • theΒ β†’ teh

Candidate Generation

  • 80% of errors are within 1 edit distance
  • ~100% of errors are within 2 edit distance

Learning

Collect statistics for each error type from a large corpus.

For example, asking for 𝐏(acress|actress) is assumed to be the same as asking for the probability that a deletion of t happened here 𝐏(c|ct)

So just collect a large corpus of text (containing errors)Β and see how often t gets deleted

Channel/Error Model Probability

𝐏(π‘₯|𝑦) = probability of edit

where:

  • 𝑑𝑒𝑙[π‘₯,𝑦]: count ofΒ π‘₯ in noisy training set when it should beΒ π‘₯𝑦
  • 𝑖𝑛𝑠[π‘₯,𝑦]: count of π‘₯𝑦 in noisy training set when it should beΒ π‘₯
  • 𝑠𝑒𝑏[π‘₯,𝑦]: count of 𝑦 in noisy training set when it should beΒ π‘₯?????
  • π‘‘π‘Ÿπ‘Ž[π‘₯,𝑦]: count of 𝑦π‘₯ in noisy training set when it should be π‘₯𝑦

single edit:

  • 𝐏(π‘₯|π‘₯𝑦) = 𝑑𝑒𝑙[π‘₯,𝑦] / π‘π‘œπ‘’π‘›π‘‘-π‘ π‘œπ‘’π‘Ÿπ‘π‘’[π‘₯,𝑦]Β # if deletion
  • 𝐏(π‘₯𝑦|π‘₯) = 𝑖𝑛𝑠[π‘₯,𝑦] / π‘π‘œπ‘’π‘›π‘‘-π‘ π‘œπ‘’π‘Ÿπ‘π‘’[π‘₯]Β # if insertion
  • 𝐏(π‘₯|𝑦) = 𝑠𝑒𝑏[π‘₯,𝑦] / π‘π‘œπ‘’π‘›π‘‘-π‘ π‘œπ‘’π‘Ÿπ‘π‘’[𝑦]Β # if substitution
  • 𝐏(𝑦π‘₯|π‘₯𝑦) = π‘‘π‘Ÿπ‘Ž[π‘₯,𝑦] / π‘π‘œπ‘’π‘›π‘‘-π‘ π‘œπ‘’π‘Ÿπ‘π‘’[π‘₯,𝑦]Β # if transposition

let:

  • 𝑠 be correct π‘ π‘œπ‘’π‘Ÿπ‘π‘’-π‘€π‘œπ‘Ÿπ‘‘ where 𝑠 =Β [𝑠1, …, 𝑠𝑗]
  • 𝑛 beΒ misspelled π‘›π‘œπ‘–π‘ π‘¦-π‘€π‘œπ‘Ÿπ‘‘ where 𝑛 =Β [𝑠1, …,Β π‘ π‘˜]

for single edit:

  • 𝐏(𝑛|𝑠) = 𝑑𝑒𝑙[𝑠𝑖-1,𝑠𝑖] / π‘π‘œπ‘’π‘›π‘‘[𝑠𝑖-1,𝑠𝑖] # if deletion
  • 𝐏(𝑛|𝑠) = 𝑖𝑛𝑠[𝑠𝑖-1,𝑛𝑖] / π‘π‘œπ‘’π‘›π‘‘[𝑠𝑖-1] # if insertion
  • 𝐏(𝑛|𝑠) = 𝑠𝑒𝑏[𝑛𝑖,𝑠𝑖] / π‘π‘œπ‘’π‘›π‘‘[𝑠𝑖] # if substitution
  • 𝐏(𝑛|𝑠) = π‘‘π‘Ÿπ‘Ž[𝑠𝑖,𝑠𝑖+1] / π‘π‘œπ‘’π‘›π‘‘[𝑠𝑖,𝑠𝑖+1] # if transposition

Noisy Channel Probability Model For β€œacress”

  • 𝑛 = acress

Candidate Correction 𝑠=?

Correct Letter

Error Letter

𝑛|𝑠

𝐏(𝑛|𝑠)

𝐏(𝑠)

𝐏(𝑛|𝑠)𝐏(𝑠)*109

actress

t

-

c|ct

0.000117

0.0000231

2.7027

cress

-

a

a|^

0.00000144

0.000000544

0.00078336

caress

ca

ac

ac|ca

0.00000164

0.00000170

0.002788

access

c

r

r|c

0.00000209

0.0000916

0.191444

across

o

e

e|o

0.0000093

0.000299

2.7807

acres

-

s

es|e

0.0000321

0.0000318

1.02078

acres

-

s

ss|s

0.0000342

0.0000318

1.08756

thus,Β acress would be corrected toΒ across

Using a Bigram Language Model

using a unigram language model is not as good as using a bigram model

for example:

  • ”… a versatile acress whose combination …”
  • what would acress be corrected to?
    • a unigram language model may correct it to acrossΒ however the sentence won’t make much sense actressΒ would be better
    • a bigram model would be better
  • 𝐏(actress|versatile) = 0.000021
  • 𝐏(whose|actress) = 0.0010
  • 𝐏(across|versatile) = 0.000021
  • 𝐏(whose|across) = 0.000006
  • 𝐏(β€˜versatile actress whose’) = 0.000021 * 0.0010 = 210x10⁻¹⁰
  • 𝐏(β€˜versatile across whose’) = 0.000021 * 0.000006 = 1.26x10⁻¹⁰