Using Boolean Multinomial Naive Bayes for Text Classification
Training
calculate 𝐏(𝐶=𝑐𝑗) prior probablities:
for each 𝑐ⱼ in 𝐶:
docsⱼ = all docs with class 𝑐ⱼ
𝐏(𝐶=𝑐ⱼ) = | docsⱼ| / |total # documents|
calculate 𝐏(𝑤ᵢ|𝑐ⱼ) likelihoods:
- in each doc remove duplicates of each word type (i.e. retain only a single instance of a word)
- corpus = single doc containing all docs
- 𝑛 = size of corpus
for each word 𝑤ᵢ in vocabulary:
𝑛ᵢ = # of occurence of 𝑤ᵢ in corpus
𝐏(𝑤ᵢ|𝑐ⱼ) = (𝑛ᵢ + 𝛼) / (𝑛 + 𝛼|vocabulary|)
Testing
on testing document 𝑑 = [𝑤₁, …, 𝑤ᵣ]:
Indent
𝑎𝑟𝑔𝑚𝑎𝑥𝑐ⱼ∊𝐶 [ 𝐏(𝑐ⱼ) · 𝛱𝑖∊𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠[𝐏(𝑤ᵢ|𝑐ⱼ)] ]
Example - Normal Naive Bayes vs Boolean Multinomial Naive Bayes
|
Normal Naive Bayes |
Boolean Multinomial Naive Bayes |
|---|---|