Linear/Normal Discriminant/Discriminative Analysis (LDA/NDA)

LDA is both a classifier and a dimensionality reduction technique
LDA is a generalization of Fisher’s Linear Discriminant, a method used to find a linear combination of features that separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification
LDA is closely related to Analysis of Variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label)
LDA is also closely related to Principal Component Analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data:
- instead of finding axes of most variation like in PCA, LDA focuses on maximizing the separability among the known categories
- factor analysis builds the feature combinations based on differences rather than similarities
LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis

LDA - Interpretations

LDA can be interpreted from two perspectives:

Probabilistic Interpretation - - useful for understanding the assumptions of LDA

Each class 𝑦∊𝑌 is assigned a prior probability 𝐏(𝑌=𝑦) such that 𝛴_𝑦∊𝑌𝐏(𝑌=𝑦) = 1

According to Bayes’ Rule, the posterior probability is:

𝐏(𝑌=𝑦’|𝑋=𝑥’) = 𝐏(𝑋=𝑥’|𝑌=𝑦’)𝐏(𝑌=𝑦’) / 𝛴_𝑦∊𝑌[𝐏(𝑋=𝑥’|𝑌=𝑦)𝐏(𝑌=𝑦)]

The Maximum a Posteriori (MAP) estimator simplifies to:

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝐏(𝑌=𝑦’|𝑋=𝑥’)

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝐏(𝑋=𝑥’|𝑌=𝑦’)𝐏(𝑌=𝑦’) / 𝛴_𝑦∊𝑌[𝐏(𝑋=𝑥’|𝑌=𝑦)𝐏(𝑌=𝑦)]

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝐏(𝑋=𝑥’|𝑌=𝑦’)𝐏(𝑌=𝑦’)

LDA assumes that the density is Gaussian:

𝐏(𝑋=𝑥’|𝑌=𝑦’) = |2𝜋𝛴_𝑦’|^-(1/2)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇_𝑦’)^𝑇·𝛴_𝑦‘^-1·(𝑥-𝜇_𝑦’) ]

where:

𝛴_𝑦‘is the covariance matrix of the samples with class 𝑌=𝑦’

𝜇_𝑦’ is the mean of the samples with class 𝑌=𝑦’

|| is the determinant

LDA assumes that all classes 𝑦∊𝑌 have the same covariance matrix:

𝛴_𝑦= 𝛴, ∀𝑦∊𝑌

Thus:

𝐏(𝑋=𝑥’|𝑌=𝑦’) = |2𝜋𝛴|^-(1/2)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇_𝑦’)^𝑇·𝛴^-1·(𝑥-𝜇_𝑦’) ]

Now substitute back to the MAP estimator:

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝐏(𝑋=𝑥’|𝑌=𝑦’)𝐏(𝑌=𝑦’)

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝐏(𝑌=𝑦’)・𝐏(𝑋=𝑥’|𝑌=𝑦’)

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝐏(𝑌=𝑦’)・|2𝜋𝛴|^-(1/2)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇_𝑦’)^𝑇·𝛴^-1·(𝑥-𝜇_𝑦’) ]

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝐏(𝑌=𝑦’)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇_𝑦’)^𝑇·𝛴^-1·(𝑥-𝜇_𝑦’) ] # removed constants

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝑙𝑜𝑔 [ 𝐏(𝑌=𝑦’)・𝑒𝑥𝑝[ -(1/2)(𝑥-𝜇_𝑦’)^𝑇·𝛴^-1·(𝑥-𝜇_𝑦’) ] ]

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝑙𝑜𝑔 [ 𝐏(𝑌=𝑦’) ] - (1/2)(𝑥-𝜇_𝑦’)^𝑇·𝛴^-1·(𝑥-𝜇_𝑦’)

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝑙𝑜𝑔 [ 𝐏(𝑌=𝑦’) ] - (1/2)(𝑥-𝜇_𝑦’)^𝑇·𝛴^-1·(𝑥-𝜇_𝑦’)

…

𝑓_𝑦ˆ(𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑦’ 𝑙𝑜𝑔 [ 𝐏(𝑌=𝑦’) ] + 𝑥^𝑇𝛴^-1𝜇_𝑦’ - (1/2)𝜇_𝑦’^𝑇𝛴^-1𝜇_𝑦’

To estimate the covariance matrix 𝛴:

𝛴ˆ = 𝛴_𝑦∊𝑌 1/(𝑁-𝐾) 𝛴_𝑦∊𝑌 1

NOTE: the deviation from the means is divided by 𝑁-𝐾, the degrees of freedom, to obtain an unbiased estimator

To estimate the means of the classes 𝜇_𝑦‘(aka centroids):

𝜇_𝑦‘ˆ = 1/𝑁_𝑦’・𝛴_{𝑥∊𝑠𝑎𝑚𝑝𝑙𝑒𝑠-𝑤𝑖𝑡ℎ-𝑦’}[𝑥]

where:

𝑁_𝑦’ - total number of observed samples with 𝑦’

The priors 𝐏(𝑌=𝑦’) are set to the prevalence ratio of the class-specific observations:

𝐏ˆ(𝑌=𝑦’) = 𝑁_𝑦’/𝑁

where:

𝑁 - total number of observed samples

With this, we have defined all parameters required for the classifier.

The dimensionality reduction procedure of LDA involves both:

the within-class variance, 𝑊 = 𝛴ˆ

the between-class variance 𝐵

The between-class variance indicates the deviation of centroids from the overall mean, 𝜇ˆ = 𝛴_𝑦∊𝑌 [ 𝐏ˆ(𝑌=𝑦’)・𝜇_𝑦‘ˆ ], and is defined as:

𝐵 = 𝛴_𝑦∊𝑌 [ 𝐏ˆ(𝑌=𝑦’)・(𝜇_𝑦‘ˆ - 𝜇ˆ) (𝜇_𝑦‘ˆ - 𝜇ˆ)^𝑇 ]

Finding a sequence of optimal substeps involves 3 steps:

compute the matrix 𝑀 containing the centroids 𝜇_𝑦’ and determine the common covariance matrix 𝑊

compute 𝑀* = 𝑀𝑊^-(1/2) using the eigen-decomposition of 𝑊

compute 𝐵* (the between-class covariance) and its eigen-decomposition 𝐵* = 𝑉*𝐷_𝐵𝑉*^𝑇. The columns 𝑣_𝑖* of 𝑉* define the coordinates of the reduced subspace

The 𝑖^th discriminant variable is determined by 𝑍_𝑖 = 𝑣_𝑖^𝑇𝑋 with 𝑣_𝑖 = 𝑊^-(1/2)𝑣_𝑖*

Linear-Algebra Interpretation (due to Fisher) - useful for understanding how LDA performs dimensionality reduction

TODO: https://www.datascienceblog.net/post/machine-learning/linear-discriminant-analysis/

TODO

／var／log marcus chiu

Explorer

Linear／Normal Discriminant／Discriminative Analysis (LDA／NDA)

Linear/Normal Discriminant/Discriminative Analysis (LDA/NDA)

LDA - Interpretations

LDA - Subpages

LDA - Resources

／var／logmarcus chiu

Explorer

Linear／Normal Discriminant／Discriminative Analysis (LDA／NDA)

Linear/Normal Discriminant/Discriminative Analysis (LDA/NDA)

LDA - Interpretations

LDA - Subpages

LDA - Resources

／var／log marcus chiu