Histogram vs KDE - Estimating Probability Density Functions

Histogram	Classification
To estimate univariate probability density distribution 𝐏(𝑋=𝑥)
Given a set of 𝑛 samples 𝐷={𝑥₁, 𝑥₂, …, 𝑥_𝑛} i.i.d drawn from a random variable 𝑋
𝐏ˆ_ℎ(𝑋=𝑥) = 1/[ℎ·𝑛]・𝛴_{1≤𝑖≤𝑛}𝛴_{𝑏𝑖𝑛∊𝐵𝐼𝑁𝑆}[𝐼(𝑥_𝑖∊𝑏𝑖𝑛)·𝐼(𝑥∊𝑏𝑖𝑛)] 𝐏ˆ_ℎ(𝑋=𝑥) = 1/[ℎ·𝑛]・𝑐𝑜𝑢𝑛𝑡(𝐷=𝑥) where: ℎ ≥ 0 - bin-width 𝑛 - total number of observed samples 𝐵𝐼𝑁𝑆 - set of all bins 𝐼() - indicator function, evaluates to 1 when true, 0 when false 𝑐𝑜𝑢𝑛𝑡(𝐷=𝑥) - total number of observed samples with 𝑥	𝐏ˆ_ℎ(𝑋=𝑥) = 1/[ℎ·𝑛]・𝛴_{1≤𝑖≤𝑛}𝑘_ℎ(𝑥_𝑖,𝑥) where: ℎ > 0 - band-width 𝑘(𝑥_𝑖,𝑥) - a univariate kernel function ∫𝛴_{1≤𝑖≤𝑛}𝑘_ℎ(𝑥_𝑖,𝑥)𝑑𝑥 = 1
To estimate a joint probability density distribution 𝐏(𝑋=𝑥,𝑍=𝑧)
Given a set of 𝑛 samples 𝐷={(𝑥₁,𝑧₁), (𝑥₂,𝑧₂), …, (𝑥_𝑛,𝑧_𝑛)} i.i.d drawn from the joint distribution of 𝑋 & 𝑍
𝐏ˆ_{ℎ𝑥·ℎ𝑧}(𝑋=𝑥,𝑍=𝑧) = 1/[ℎ_𝑥·ℎ_𝑧·𝑛]・𝛴_{1≤𝑖≤𝑛}𝛴_{𝑏𝑖𝑛∊𝐵𝐼𝑁𝑆}[𝐼((𝑥_𝑖,𝑧_𝑖)∊𝑏𝑖𝑛)·𝐼((𝑥,𝑧)∊𝑏𝑖𝑛)] where: ℎ_𝑥 ≥ 0 - bin-width on 𝑥 axis ℎ_𝑧 ≥ 0 - bin-width on 𝑧 axis	𝐏ˆ_ℎ(𝑋=𝑥,𝑍=𝑧) = 1/[ℎ²·𝑛]・𝛴_{1≤𝑖≤𝑛}𝑘_ℎ((𝑥_𝑖,𝑧_𝑖),(𝑥,𝑧)) where: 𝑘_ℎ((𝑥_𝑖,𝑧_𝑖),(𝑥,𝑧)) - a bivariate kernel function
To estimate the conditional probability density 𝐏(𝑋=𝑥\|𝑌=𝑦)
𝐏ˆ(𝑋=𝑥\|𝑌=𝑦) = 1/[ℎ·𝑐𝑜𝑢𝑛𝑡(𝑌=𝑦)]・𝑐𝑜𝑢𝑛𝑡(𝑋=𝑥,𝑌=𝑦) where: ℎ > 0 - a parameter called bandwidth 𝑐𝑜𝑢𝑛𝑡(𝑌=𝑦) - total number of observed samples with 𝑦 𝑐𝑜𝑢𝑛𝑡(𝑋=𝑥,𝑌=𝑦) - total number of observed samples with 𝑥 and 𝑦	𝐏ˆ(𝑋=𝑥\|𝑌=𝑦) = 1/[ℎ·𝑐𝑜𝑢𝑛𝑡(𝑌=𝑦)]・𝛴₍_{𝑥_𝑖,𝑦_𝑖)∊𝑎𝑙𝑙-𝑠𝑎𝑚𝑝𝑙𝑒𝑠}𝑘[(𝑥_𝑖,𝑦_𝑖),(𝑥,𝑦)] where: 𝑘[(𝑥_𝑖,𝑦_𝑖),(𝑥,𝑦)] - the kernel function
Example estimate of probability distribution 𝐏(𝑋=𝑥)
the red-dotted-line represents a gaussian-kernel-function for each observation
𝐏ˆ(𝑋=𝑥) = 1/[ℎ·𝑛]・𝑐𝑜𝑢𝑛𝑡(𝑋=𝑥) where: 𝑛 = 6 ℎ = 2 # here we chose 2 but we could choose any number greater than 0 evaluation: 𝐏ˆ(𝑋=𝑥) = 1/12・𝑐𝑜𝑢𝑛𝑡(𝑋=𝑥) Click here to expand... 𝐏ˆ(𝑋=-4) = 1/12・1 = 1/12 𝐏ˆ(𝑋=-3.5) = 1/12・1 = 1/12 𝐏ˆ(𝑋=-3) = 1/12・1 = 1/12 𝐏ˆ(𝑋=-2.5) = 1/12・1 = 1/12 𝐏ˆ(𝑋=-2) = 1/12・2 = 1/6 𝐏ˆ(𝑋=-1.5) = 1/12・2 = 1/6 𝐏ˆ(𝑋=-1) = 1/12・2 = 1/6 𝐏ˆ(𝑋=-1.5) = 1/12・2 = 1/6 𝐏ˆ(𝑋=0) = 1/12・1 = 1/12 𝐏ˆ(𝑋=0.5) = 1/12・1 = 1/12 𝐏ˆ(𝑋=1) = 1/12・1 = 1/12 𝐏ˆ(𝑋=1.5) = 1/12・1 = 1/12 𝐏ˆ(𝑋=2) = 1/12・0 = 0 𝐏ˆ(𝑋=2.5) = 1/12・0 = 0 𝐏ˆ(𝑋=3) = 1/12・0 = 0 𝐏ˆ(𝑋=3.5) = 1/12・0 = 0 𝐏ˆ(𝑋=4) = 1/12・1 = 1/12 𝐏ˆ(𝑋=4.5) = 1/12・1 = 1/12 𝐏ˆ(𝑋=5) = 1/12・1 = 1/12 𝐏ˆ(𝑋=5.5) = 1/12・1 = 1/12 𝐏ˆ(𝑋=6) = 1/12・1 = 1/12 𝐏ˆ(𝑋=6.5) = 1/12・1 = 1/12 𝐏ˆ(𝑋=7) = 1/12・1 = 1/12 𝐏ˆ(𝑋=7.5) = 1/12・1 = 1/12	𝐏ˆ(𝑋=𝑥) = 1/[ℎ·𝑛]・𝛴_{𝑥_𝑖∊𝑎𝑙𝑙-𝑠𝑎𝑚𝑝𝑙𝑒𝑠}𝑘(𝑥_𝑖,𝑥) where: 𝑛 = 6 ℎ = 0.5 # here we chose 0.5 but we could choose any number greater than 0 𝑘(𝑥_𝑖,𝑥) = 𝑒𝑥𝑝(-𝛾·\|\|𝑥_𝑖-𝑥\|\|²) # in this case we use a gaussian kernel, but we could choose any other kernel function evaluation: 𝐏ˆ(𝑋=𝑥) = 1/3・𝛴_{𝑥_𝑖∊𝑎𝑙𝑙-𝑠𝑎𝑚𝑝𝑙𝑒𝑠}𝑒𝑥𝑝(-𝛾·\|\|𝑥_𝑖-𝑥\|\|²)
Choice of Bandwidth h & Bias-Variance Tradeoff
TODO

Resources