Using Boolean Multinomial Naive Bayes for Text Classification

Training

calculate 𝐏(𝐶=𝑐𝑗) prior probablities:

for each 𝑐ⱼ in 𝐶:
	docsⱼ = all docs with class 𝑐ⱼ
	𝐏(𝐶=𝑐ⱼ) = | docsⱼ| / |total # documents|

calculate 𝐏(𝑤ᵢ|𝑐ⱼ) likelihoods:

  • in each doc remove duplicates of each word type (i.e. retain only a single instance of a word)
  • corpus = single doc containing all docs
  • 𝑛 = size of corpus
for each word 𝑤ᵢ in vocabulary:
	𝑛ᵢ = # of occurence of 𝑤ᵢ in corpus
	𝐏(𝑤ᵢ|𝑐ⱼ) = (𝑛ᵢ + 𝛼) / (𝑛 + 𝛼|vocabulary|)

Testing

on testing document 𝑑 = [𝑤₁, …, 𝑤ᵣ]:

Indent

𝑎𝑟𝑔𝑚𝑎𝑥𝑐ⱼ∊𝐶 [ 𝐏(𝑐ⱼ) · 𝛱𝑖∊𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠[𝐏(𝑤ᵢ|𝑐ⱼ)] ]

Example - Normal Naive Bayes vs Boolean Multinomial Naive Bayes

Normal Naive Bayes

Boolean Multinomial Naive Bayes