N-Gram Smoothing Techniques

Description

Add-One Smoothing

𝐏(𝑀2|𝑀1) = [𝐢(𝑀1𝑀2)+1] / [𝐢(𝑀1)+𝑉]

where:

  • 𝐢(𝑀1𝑀2) - counts of 𝑀1𝑀2 occuring in corpus
  • 𝐢(𝑀1) - counts of 𝑀1occuring in corpus
  • 𝑉 - vocabulary size

Add-π‘Ž Smoothing

𝐏(𝑀2|𝑀1) = [𝐢(𝑀1𝑀2)+π‘Ž] / [𝐢(𝑀1)+π‘Žπ‘‰]

Good-Turing Discounting

𝐏(𝑀2|𝑀1) = 𝐏*(𝑀1𝑀2) / 𝐏*(𝑀1)

where:

  • 𝐏*(𝑀1𝑀2) = ((𝑐12+1)·𝑁𝑐12+1) / (𝑁𝑐12·𝑁)
  • 𝐏*(𝑀1) = ((𝑐1+1)·𝑁𝑐1+1) / (𝑁𝑐1·𝑁)

where:

  • 𝑐12 = 𝐢(𝑀1𝑀2) - counts of 𝑀1𝑀2 occuring in corpus
  • 𝑐1 = 𝐢(𝑀1) - counts of 𝑀1occuring in corpus
  • 𝑁𝑐 = the number of N-Grams that occur 𝑐 times
  • 𝑁 = total number of N-Gram𝑠

Backoff