Álgebra de LDA. El poder de discriminación de Fisher de un análisis discriminante variable y lineal

Aquí hay una historia corta sobre el Análisis discriminante lineal (LDA) como respuesta a la pregunta.

Cuando tenemos una variable y grupos (clases) para discriminar por ella, esto es ANOVA. El poder de discriminación de la variable es , o . $k$ $SS_\text{between groups} / SS_\text{within groups}$ $B/W$

Cuando tenemos variables, esto es MANOVA. Si las variables no están correlacionadas ni en la muestra total ni dentro de los grupos, entonces el poder de discriminación anterior, , se calcula de forma análoga y podría escribirse como , donde es la matriz de dispersión agrupada dentro del grupo (es decir, la suma de matrices SSCP de las variables, centradas en el centroide de los respectivos grupos); $p$ $B/W$ $trace(\bf{S_b})$ $/trace(\bf{S_w})$ $\bf{S_w}$ $k$ p x p $\bf{S_b}$ es la matriz de dispersión entre grupos , donde es la matriz de dispersión para todos los datos (matriz SSCP de las variables centradas en el gran centroide. (Una "matriz de dispersión" es solo una matriz de covarianza sin devidedness) por sample_size-1.) $=\bf{S_t}-\bf{S_w}$ $\bf{S_t}$

Cuando hay alguna correlación entre las variables, y generalmente la hay, el se expresa mediante que ya no es un escalar sino una matriz. Esto simplemente se debe a que hay variables discriminatorias ocultas detrás de esta discriminación "general" y en parte compartiéndola. $B/W$ $\bf{S_w^{-1} S_b}$ $p$

Ahora, es posible que queramos sumergirnos en MANOVA y descomponer en variables latentes nuevas y mutuamente ortogonales (su número es ) llamadas funciones discriminantes o discriminantes : la primera es la más fuerte discriminador, el segundo es el siguiente, etc. Al igual que lo hacemos en el análisis de componentes de Pricipal. Reemplazamos variables correlacionadas originales por discriminantes no correlacionados sin pérdida de poder discriminativo. Debido a que cada siguiente discriminante es cada vez más débil, podemos aceptar un pequeño subconjunto de primeros $\bf{S_w^{-1} S_b}$ $min(p,k-1)$ $m$ discriminantes sin gran pérdida de poder discriminativo (de nuevo, similar a cómo usamos PCA). Esta es la esencia de LDA en cuanto a la técnica de reducción de dimensionalidad (LDA también es una técnica de clasificación de Bayes, pero este es un tema completamente separado).

LDA por lo tanto se asemeja a PCA. PCA descompone "correlación", LDA descompone "separación". En LDA, debido a que la matriz anterior que expresa "separación" no es simétrica, se utiliza un truco algebraico de derivación para encontrar sus valores propios y vectores propios . El valor propio de cada función discriminante (una variable latente) es su poder discriminatorio que hablaba en el primer párrafo. Además, vale la pena mencionar que los discriminantes, aunque no están correlacionados, no son geométricamente ortogonales como ejes dibujados en el espacio variable original. $^1$ $B/W$

Algunos temas potencialmente relacionados que es posible que desee leer:

LDA es MANOVA "profundizado" en el análisis de la estructura latente y es un caso particular de análisis de correlación canónica (equivalencia exacta entre ellos como tal ). Cómo LDA clasifica los objetos y cuáles son los coeficientes de Fisher. (Actualmente, solo recuerdo mis propias respuestas, tal como las recuerdo, pero también hay muchas respuestas buenas y mejores de otras personas en este sitio).

cálculos de la fase de extracción de LDAson los siguientes. Los valores propios ( ) de son los mismos que para la matriz simétrica , donde es laraízdeCholeskyde : una matriz triangular superior por la cual . En cuanto a los vectores propios de , están dados por $^1$ $\bf L$ $\bf{S_w^{-1} S_b}$ $\bf{(U^{-1})' S_b U^{-1}}$ $\bf U$ $\bf{S_w}$ $\bf{U'U=S_w}$ $\bf{S_w^{-1} S_b}$ $\bf{V=U^{-1} E}$ $\bf E$ $\bf{(U^{-1})' S_b U^{-1}}$ $\bf U$

$\bf{S_w^{-1} S_b}$ $\bf{S_w}$ $\bf S_w^{-1/2}$ $\bf S_w^{-1/2} S_b S_w^{-1/2}$ (which is a symmetric matrix) yields discriminant eigenvalues $\bf L$ and eigenvectors $\bf A$ , whereby the discriminant eigenvectors $\bf V= S_w^{-1/2} A$ . The "quasi zca-whitening" method can be rewritten to be done via singular-value-decomposition of casewise dataset instead of working with $\bf S_w$ and $\bf S_b$ scatter matrices; that adds computational precision (what is important in near-singularity situation), but sacrifices speed.

OK, let's turn to the statistics usually computed in LDA. Canonical correlations corresponding to the eigenvalues are $\bf \Gamma = \sqrt{L/(L+1)}$ . Whereas eigenvalue of a discriminant is $B/W$ of the ANOVA of that discriminant, canonical correlation squared is $B/T$ (T = total sum-of-squares) of that ANOVA.

If you normalize (to SS=1) columns of eigenvectors $\bf V$ then these values can be seen as the direction cosines of the rotation of axes-variables into axes-discriminants; so with their help one can plot discriminants as axes on the scatterplot defined by the original variables (the eigenvectors, as axes in that variables' space, are not orthogonal).

The unstandardized discriminant coefficients or weights are simply the scaled eigenvectors $\bf {C}= \it \sqrt{N-k} ~\bf V$ . These are the coefficients of linear prediction of discriminants by the centered original variables. The values of discriminant functions themselves (discriminant scores) are $\bf XC$ , where $\bf X$ is the centered original variables (input multivariate data with each column centered). Discriminants are uncorrelated. And when computed by the just above formula they also have the property that their pooled within-class covariance matrix is the identity matrix.

Optional constant terms accompanying the unstandardized coefficients and allowing to un-center the discriminants if the input variables had nonzero means are $\bf {C_0} \it = -\sum^p diag(\bar{X}) \bf C$ , where $diag(\bar{X})$ is the diagonal matrix of the p variables' means and $\sum^p$ is the sum across the variables.

In standardized discriminant coefficients, the contribution of variables into a discriminant is adjusted to the fact that variables have different variances and might be measured in different units; $\bf {K} \it = \sqrt{diag \bf (S_w)} \bf V$ (where diag(Sw) is diagonal matrix with the diagonal of $\bf S_w$ ). Despite being "standardized", these coefficients may occasionally exceed 1 (so don't be confused). If the input variables were z-standardized within each class separately, standardized coefficients = unstandardized ones. Coefficients may be used to interpret discriminants.

Pooled within-group correlations ("structure matrix", sometimes called loadings) between variables and discriminants are given by $\bf R= \it diag \bf (S_w)^{-1} \bf S_w V$ . Correlations are insensitive to collinearity problems and constitute an alternative (to the coefficients) guidance in assessment of variables' contributions, and in interpreting discriminants.

See the complete output of the extraction phase of the discriminant analysis of iris data here.

Read this nice later answer which explains a bit more formally and detailed the same things as I did here.

This question deals with the issue of standardizing data before doing LDA.

ttnphns
fuente

As said in your answer, primarily LDA is used to do dimension reduction, but if the purpose is just classification, then we can just simply use the Bayes approach, right? But if the purpose is dimension reduction, then we have to take the Fisher's approach to find those directions on which we will project the original input

X

$X$ , right?

avocado

Yes. However, word "Fisher's approach" is ambiguous. It can mean 2 things: 1) LDA (for 2 classes) itself; 2) Fisher's classification functions in LDA.

ttnphns

Álgebra de LDA. El poder de discriminación de Fisher de un análisis discriminante variable y lineal

Respuestas: