Linear/Normal Discriminant/Discriminative Analysis (LDA/NDA)
- LDA is both a classifier and a dimensionality reduction technique
- LDA is a generalization of Fisher’s Linear Discriminant, a method used to find a linear combination of features that separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification
- LDA is closely related to Analysis of Variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label)
- LDA is also closely related to Principal Component Analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data:
- instead of finding axes of most variation like in PCA, LDA focuses on maximizing the separability among the known categories
- factor analysis builds the feature combinations based on differences rather than similarities
- LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis
LDA - Interpretations
LDA can be interpreted from two perspectives:
Probabilistic Interpretation - - useful for understanding the assumptions of LDA
Each class 𝑦∊𝑌 is assigned a prior probability 𝐏(𝑌=𝑦) such that 𝛴𝑦∊𝑌𝐏(𝑌=𝑦) = 1
According to Bayes’ Rule, the posterior probability is:
- 𝐏(𝑌=𝑦’|𝑋=𝑥’) = 𝐏(𝑋=𝑥’|𝑌=𝑦’)𝐏(𝑌=𝑦’) / 𝛴𝑦∊𝑌[𝐏(𝑋=𝑥’|𝑌=𝑦)𝐏(𝑌=𝑦)]
The Maximum a Posteriori (MAP) estimator simplifies to:
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝐏(𝑌=𝑦’|𝑋=𝑥’)
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝐏(𝑋=𝑥’|𝑌=𝑦’)𝐏(𝑌=𝑦’) / 𝛴𝑦∊𝑌[𝐏(𝑋=𝑥’|𝑌=𝑦)𝐏(𝑌=𝑦)]
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝐏(𝑋=𝑥’|𝑌=𝑦’)𝐏(𝑌=𝑦’)
LDA assumes that the density is Gaussian:
- 𝐏(𝑋=𝑥’|𝑌=𝑦’) = |2𝜋𝛴𝑦’|-(1/2)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇𝑦’)𝑇·𝛴𝑦‘-1·(𝑥-𝜇𝑦’) ]
where:
- 𝛴𝑦‘is the covariance matrix of the samples with class 𝑌=𝑦’
- 𝜇𝑦’ is the mean of the samples with class 𝑌=𝑦’
- || is the determinant
LDA assumes that all classes 𝑦∊𝑌 have the same covariance matrix:
- 𝛴𝑦= 𝛴, ∀𝑦∊𝑌
Thus:
- 𝐏(𝑋=𝑥’|𝑌=𝑦’) = |2𝜋𝛴|-(1/2)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇𝑦’)𝑇·𝛴-1·(𝑥-𝜇𝑦’) ]
Now substitute back to the MAP estimator:
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝐏(𝑋=𝑥’|𝑌=𝑦’)𝐏(𝑌=𝑦’)
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝐏(𝑌=𝑦’)・𝐏(𝑋=𝑥’|𝑌=𝑦’)
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝐏(𝑌=𝑦’)・|2𝜋𝛴|-(1/2)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇𝑦’)𝑇·𝛴-1·(𝑥-𝜇𝑦’) ]
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝐏(𝑌=𝑦’)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇𝑦’)𝑇·𝛴-1·(𝑥-𝜇𝑦’) ] # removed constants
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝑙𝑜𝑔 [ 𝐏(𝑌=𝑦’)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇𝑦’)𝑇·𝛴-1·(𝑥-𝜇𝑦’) ] ]
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝑙𝑜𝑔 [ 𝐏(𝑌=𝑦’) ] - (1/2)(𝑥-𝜇𝑦’)𝑇·𝛴-1·(𝑥-𝜇𝑦’)
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝑙𝑜𝑔 [ 𝐏(𝑌=𝑦’) ] - (1/2)(𝑥-𝜇𝑦’)𝑇·𝛴-1·(𝑥-𝜇𝑦’)
- …
- 𝑓𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦’ 𝑙𝑜𝑔 [ 𝐏(𝑌=𝑦’) ] + 𝑥𝑇𝛴-1𝜇𝑦’ - (1/2)𝜇𝑦’𝑇𝛴-1𝜇𝑦’
To estimate the covariance matrix 𝛴:
- 𝛴ˆ = 𝛴𝑦∊𝑌 1/(𝑁-𝐾) 𝛴𝑦∊𝑌 1
NOTE: the deviation from the means is divided by 𝑁-𝐾, the degrees of freedom, to obtain an unbiased estimator
To estimate the means of the classes 𝜇𝑦‘(aka centroids):
- 𝜇𝑦‘ˆ = 1/𝑁𝑦’・𝛴𝑥∊𝑠𝑎𝑚𝑝𝑙𝑒𝑠-𝑤𝑖𝑡ℎ-𝑦’[𝑥]
where:
- 𝑁𝑦’ - total number of observed samples with 𝑦’
The priors 𝐏(𝑌=𝑦’) are set to the prevalence ratio of the class-specific observations:
- 𝐏ˆ(𝑌=𝑦’) = 𝑁𝑦’/𝑁
where:
- 𝑁 - total number of observed samples
With this, we have defined all parameters required for the classifier.
The dimensionality reduction procedure of LDA involves both:
- the within-class variance, 𝑊 = 𝛴ˆ
- the between-class variance 𝐵
The between-class variance indicates the deviation of centroids from the overall mean, 𝜇ˆ = 𝛴𝑦∊𝑌 [ 𝐏ˆ(𝑌=𝑦’)・𝜇𝑦‘ˆ ], and is defined as:
- 𝐵 = 𝛴𝑦∊𝑌 [ 𝐏ˆ(𝑌=𝑦’)・(𝜇𝑦‘ˆ - 𝜇ˆ) (𝜇𝑦‘ˆ - 𝜇ˆ)𝑇 ]
Finding a sequence of optimal substeps involves 3 steps:
- compute the matrix 𝑀 containing the centroids 𝜇𝑦’ and determine the common covariance matrix 𝑊
- compute 𝑀* = 𝑀𝑊-(1/2) using the eigen-decomposition of 𝑊
- compute 𝐵* (the between-class covariance) and its eigen-decomposition 𝐵* = 𝑉*𝐷𝐵𝑉*𝑇. The columns 𝑣𝑖* of 𝑉* define the coordinates of the reduced subspace
The 𝑖th discriminant variable is determined by 𝑍𝑖 = 𝑣𝑖𝑇𝑋 with 𝑣𝑖 = 𝑊-(1/2)𝑣𝑖*
Linear-Algebra Interpretation (due to Fisher) - useful for understanding how LDA performs dimensionality reduction
TODO
LDA - Subpages
- Logistic Regression (LR) vs Linear Discriminant Analysis (LDA)
- Linear Discriminant Analysis (LDA) vs Quadratic Discriminant Analysis (QDA)