Real-Word Spelling Errors

for example:

  • leaving in about fifteen minuets
  • the design an construction of the system
  • can they lave him my messages?
  • the study was conducted mainly be John Black

25-40% of spelling errors are real words

Solving Real-World Spelling Errors - Processes

for each word in sentence:

  • generate candidate set
  • the word itself
  • all single-letter edits that are English words
  • words that are homophones

choose best candidates

  • noisy channel model
  • task-specific classifier

given a sentence [𝑤1, …, 𝑤𝑘] generate a set of candidates for each word 𝑤𝑖:

  • candidate(𝑤1) = [𝑤1, 𝑤1’, 𝑤1”, …]
  • candidate(𝑤𝑘) = [𝑤𝑘, 𝑤𝑘’, 𝑤𝑘”, …]

choose sequence 𝑊 that maximizes 𝐏(𝑊)

Example Noisy Channel For Real-Word Spell Correction

Simplification: One Error Word Per Sentence
  • out of all possible sentence with one word replaced
    • 𝑤1, 𝑤”2, 𝑤3, … two on thew
    • 𝑤1, 𝑤2𝑤‘3, … two of threw
    • 𝑤'''1, 𝑤2, 𝑤3, … too of thew
  • choose sequence 𝑊 that maximizes 𝐏(𝑊)
Where to Get Probabilities
  • language model (e.g. unigram, bigram, n-gram) optionally with stupid backoff
  • channel/error model
    • same as for non-word spelling correction
    • plus additional probability for no error 𝐏(𝑤|𝑤)

Peter Norvig’s “threw” Example

noisy word

candidate source word

𝑛𝑜𝑖𝑠𝑦-𝑐ℎ𝑎𝑟(𝑠) | 𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠)

𝐏(𝑛𝑜𝑖𝑠𝑦-𝑐ℎ𝑎𝑟(𝑠) | 𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠))

𝐏(𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠))

𝐏(𝑛𝑜𝑖𝑠𝑦-𝑐ℎ𝑎𝑟(𝑠) | 𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠)) · 𝐏(𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠))

thew

the

ew|e

0.000007

0.02

0.00000014

thew

thew

0.95

0.00000009

0.000000085

thew

thaw

e|a

0.001

0.0000007

0.000000001

thew

threw

h|hr

0.000008

0.000004

0.00000000003

thew

thwe

ew|we

0.000003

0.00000004

0.0000000000001