Real-Word Spelling Errors
for example:
- leaving in about fifteen minuets
- the design an construction of the system
- can they lave him my messages?
- the study was conducted mainly be John Black
25-40% of spelling errors are real words
Solving Real-World Spelling Errors - Processes
for each word in sentence:
- generate candidate set
- the word itself
- all single-letter edits that are English words
- words that are homophones
choose best candidates
- noisy channel model
- task-specific classifier
given a sentence [𝑤1, …, 𝑤𝑘] generate a set of candidates for each word 𝑤𝑖:
- candidate(𝑤1) = [𝑤1, 𝑤1’, 𝑤1”, …]
- …
- candidate(𝑤𝑘) = [𝑤𝑘, 𝑤𝑘’, 𝑤𝑘”, …]
choose sequence 𝑊 that maximizes 𝐏(𝑊)
Example Noisy Channel For Real-Word Spell Correction
---cognitive-computing---machine-intelligence/ai---subfields/natural-language-processing-(nlp)---computational-linguistics/nlp---spelling-error-correction/spelling-error---real-words/real-word-spelling-error.png)
Simplification: One Error Word Per Sentence
- out of all possible sentence with one word replaced
- 𝑤1, 𝑤”2, 𝑤3, … two on thew
- 𝑤1, 𝑤2, 𝑤‘3, … two of threw
- 𝑤'''1, 𝑤2, 𝑤3, … too of thew
- choose sequence 𝑊 that maximizes 𝐏(𝑊)
Where to Get Probabilities
- language model (e.g. unigram, bigram, n-gram) optionally with stupid backoff
- channel/error model
- same as for non-word spelling correction
- plus additional probability for no error 𝐏(𝑤|𝑤)
Peter Norvig’s “threw” Example
|
noisy word |
candidate source word |
𝑛𝑜𝑖𝑠𝑦-𝑐ℎ𝑎𝑟(𝑠) | 𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠) |
𝐏(𝑛𝑜𝑖𝑠𝑦-𝑐ℎ𝑎𝑟(𝑠) | 𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠)) |
𝐏(𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠)) |
𝐏(𝑛𝑜𝑖𝑠𝑦-𝑐ℎ𝑎𝑟(𝑠) | 𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠)) · 𝐏(𝑠𝑜𝑢𝑟𝑐𝑒-𝑐ℎ𝑎𝑟(𝑠)) |
|---|---|---|---|---|---|
|
thew |
the |
ew|e |
0.000007 |
0.02 |
0.00000014 |
|
thew |
thew |
0.95 |
0.00000009 |
0.000000085 | |
|
thew |
thaw |
e|a |
0.001 |
0.0000007 |
0.000000001 |
|
thew |
threw |
h|hr |
0.000008 |
0.000004 |
0.00000000003 |
|
thew |
thwe |
ew|we |
0.000003 |
0.00000004 |
0.0000000000001 |