1. Introduction

In this tutorial, we’ll delve into the intricacies of Linear Discriminant Analysis (LDA).

2. Dimensionality Reduction Techniques

Dimensionality reduction simplifies datasets by reducing dimensions and categorizing them into unsupervised and supervised approaches.

Unsupervised methods like principal component analysis (PCA) and independent component analysis (ICA) don’t require class labels, offering versatility. Supervised approaches like mixture discriminant analysis (MDA), neural networks (NN), and linear discriminant analysis (LDA) integrate class labels. Our focus is on LDA.

3. What Is Linear Discriminant Analysis (LDA)?

LDA is a powerful dimensionality reduction technique. It seeks to transform our data into a lower-dimensional space, enhancing class separability while minimizing within-class variance:

We observe three data clusters: class A, class B, and class C. Projecting along dimensional space 1 reduces within-class variance but loses between-class variance for classes A and B. Dimensional space 2 maintains both variances. Thus, we discard space 1 and use space 2 for LDA.

3.1. Steps to Performing LDA

In tackling dimensionality reduction with LDA, we divide the process into three main steps.

  • First, we maximize class separability by measuring the distance between their means (between-class variance)
  • Second, we minimize the closeness of each data point within a class to its mean (within-class variance)
  • Lastly, we use these parameters to construct an appropriate lower-dimensional space

The figure below illustrates these steps visually:

This newly constructed space simplifies our data and strengthens its power to distinguish between classes.

3.2. Approaches to LDA

Exploring LDA, we encounter class-dependent and class-independent approaches. In class-dependent LDA, each class gets its own lower-dimensional space. Conversely, class-independent LDA treats each class separately but shares one lower-dimensional space.

4. Mathematical Foundations of LDA

In this section, we discuss the mathematical process for performing LDA.

4.1. Calculating the Between-class Variance

To calculate the between-class variance S_B, we follow the following steps:

We first calculate the separation distance (m_i - m) as follows:

(1)   \begin{equation*}(m_i-m)^2 = (W^Tu_i - W^Tu)^2 = W^T(u_i - u)(u_i - u)^TW\end{equation*}

where, m_i represents the projection of the mean of the i^{th} class which we calculate as follows:

m_i = W^Tu_i.

m is the projection of the total mean of classes which we calculate as follows:

m = W^Tu,

where W represents the transformation matrix of LDA, u_i represents the mean of the i^{th} class and we calculate it as follows,

u_i = \frac{1}{n_i}\sum_{x_i\in w_i}x_i.

u is the total mean of all classes which we calculate as follows:

u = \frac{1}{N}\sum_{i=1}^{N}x_i = \frac{1}{c}\sum_{i=1}^{c}u_i,

where c is the total number of classes.

The term (u_i - u)(u_i - u)^T in equation 1 represents the between-class variance S_{Bi} of the i^{th} class.

We substitute S_{Bi} into equation 1 as follows:

(2)   \begin{equation*}(m_i-m)^2 = W^TS_{Bi}W\end{equation*}

We now obtain the total between-class variance S_B as follows:

(3)   \begin{equation*}S_B = \sum_{i=1}^{c}n_{i}S_{Bi}\end{equation*}

4.2. Calculating the Within-class Variance

To calculate the between-class variance S_W, we take the following steps:

The within-class variance of the i^{th} class (S_{Wi}) is the difference between the mean and the samples of that class.

We then search for a lower-dimensional space to minimize the difference between the projected mean (m_i) and the projected samples of each class (W^Tx_i), which is the within-class variance.

We capture these steps with the equations as follows:

(4)   \begin{equation*}\sum_{x_i\in w_j}(W^Tx_i - m_i)^2 = \sum_{x_i\in w_j}W^T(x_i - u_i )(x_i - u_i )^TW =  \sum_{x_i\in w_j}W^TS_{Wi}W\end{equation*}

From equation 4, we calculate the within-class variance as follows;

S_{Wi} = d^T_i * d_i = \sum_{k=1}^{n_i}(x_{ki} - u_i)(x_{ki} - u_i)^T,

where x_{ki} represents the k^{th} sample in the i^{th} class and d_i is the centring data of the i^{th} class, i.e.

d_i = w_i - u_i = \{{x_k\}}_{k = 1}^{n_i} - u_i.

To calculate the total within-class variance of our data, we sum up the within-class variance of all classes present in our data as follows;

S_W = \sum_{i=1}^cS_{Wi} = \sum_{x_i\in w_1}(x_i - u_1)(x_i - u_1)^T + \sum_{x_i\in w_2}(x_i - u_2)(x_i - u_2)^T  + … + \sum_{x_i\in w_c}(x_i - u_c)(x_i - u_c)^T.

4.3. Constructing the Lower Dimensional Space: Fisher’s Criterion

To transform the higher dimensional features of our data into a lower dimensional space, we maximize the ratio of  S_B to S_W to guarantee maximum class separability by using Fisher’s criterion J(W), which we write this way:

(5)   \begin{equation*}J(W) = \frac{W^TS_BW}{W^TS_WW}\end{equation*}

To calculate the transformation matrix W, we transform equation 5 into equation 6 as follows:

(6)   \begin{equation*}S_WW = \lambda S_BW\end{equation*}

where \lambda represents the eigenvalues of the transformation matrix W

Next, we transform equation 6 into equation 7 as follows:

(7)   \begin{equation*}W = S_W^{-1}S_B\end{equation*}

we can then calculate the eigenvalues (\lambda = {\lambda_1, \lambda_2,....,\lambda_M}) and their corresponding eigenvectors (V = v_1,v_2,....,v_M) of W given in equation 7.

4.4. The Role of the Eigenvalues and Eigenvectors

In LDA, eigenvectors determine the direction of the space, while eigenvalues indicate their magnitude. We select the k highest eigenvalue eigenvectors to form a lower dimensional space V_k, neglecting others. This reduces our original data X from dimension M to k. We write this new dimensional space Y as:

(8)   \begin{equation*}Y = XV_k\end{equation*}

5. Class-independent and Class-dependent LDA Approach

Here, we describe the slight difference between the class-independent and class-dependent LDA approaches and also provide the algorithms for both.

5.1. Class-independent LDA Approach

In the class-independent method, we use only one lower-dimensional space for all classes. Thus, we calculate the transformation matrix W_i for all classes, and we project the samples of all classes on the selected eigenvectors. 

Here is the algorithm for the class-independent LDA approach:

Rendered by QuickLaTeX.com

5.2. Class-dependent LDA Approach

In class-dependent LDA, we compute distinct lower-dimensional spaces for each class using W_i = S_{W_i}^{-1}S_B. We find eigenvalues and eigenvectors for each W_i separately, then project samples onto their respective eigenvectors.

We provide the algorithm for the class-dependent LDA approach as follows:

Rendered by QuickLaTeX.com

6. Computational Complexity of LDA

In this section, we provide the computational complexities of the two classes of LDA

6.1. Computational Complexity for Class-independent LDA

In analyzing the computational complexity of class-dependent LDA, we consider various steps. First, computing the mean of each class has complexity O(DN + cD). Second, calculating the total mean has complexity O(ND + D). Third, computing within-class scatter involves O(ND + ND^2). Fourth, calculating between-class scatter has complexity O(cD + cD^2). Fifth, finding eigenvalues and eigenvectors involves O(D^3). Sorting eigenvectors is O(D \log D). Selecting k eigenvectors is O(Dk). Finally, projecting data onto eigenvectors is O(DkN). Overall, class-independent LDA has complexity O(ND^2) for N > D, otherwise O(D^3).

6.2. Computational Complexity for Class-dependent LDA

In analyzing computational complexity, both class-dependent and class-independent LDAs follow similar steps. However, class-dependent LDA repeats the process for each class, increasing complexity. Overall, class-dependent LDA complexity is O(ND^3) for N > D, or O(cD^3), surpassing class-independent LDA in complexity.

7. Numerical Example

In this section, we will present a numerical example explaining how to calculate the LDA space step by step and how LDA is used to discriminate two different classes of data using the class-independent and class-dependent approach.

The first class w_1 is a 5 x 2 matrix with the number of samples n_1 = 5. The second class w_2 is a 6 x 2 matrix with n_2 = 6. Each sample in both classes has two features (i.e., D = 2) as follows:

w_1 = \begin{bmatrix} 2.3 & 1.2 \\ 0.8 & 2.0 \\ 5.1 & 1.5 \\ 6.2 & 4.4 \\ 3.8 & 0.9 \end{bmatrix}, and, w_2 = \begin{bmatrix} 6.4 & 1.5 \\ 9.3 & 4.7 \\ 3.2 & 5.1 \\ 0.9 & 8.2 \\ 7.2 & 1.4 \\ 2.2 & 5.8 \end{bmatrix}.

First, we calculate the mean u_j for each class, and then the total mean u for both classes. The resulting answers are as follows;

u_1 = \begin{bmatrix} 3.64 & 2.00 \end{bmatrix}, u_2 = \begin{bmatrix} 4.87 & 4.45 \end{bmatrix}, and u = \begin{bmatrix} 4.31 & 3.32 \end{bmatrix}.

Next, we calculate the between-class variance S_{B_i} of each class, and the total between-class variance S_B of the given data, as follows:

S_{B_1} = n_1(u_1 - u)^T(u_1 - u), so that,

S_{B_1}= \begin{bmatrix} 2.24 & 4.47 \\ 4.47 & 8.93 \end{bmatrix}

similarly,

S_{B_2} = n_2(u_2 - u)^T(u_2 - u), so that,

S_{B_2}= \begin{bmatrix} 1.87 & 3.73 \\ 3.73 & 7.44 \end{bmatrix}

S_B becomes,

S_B = S_{B_1} + S_{B_2} = \begin{bmatrix} 4.10 & 8.20 \\ 8.20 & 16.37 \end{bmatrix}.

Next, we calculate the within-class variance S_W. To do this, we perform the mean-centering data process by subtracting the mean of each class from each sample in that class. We achieve this as follows: d_i = w_i - u_i. Where d_i is the mean-centering data of the class w_i. The values of d_1 and d_2 becomes:

d_1 = \begin{bmatrix} -1.34 & -0.80 \\ -2.84 & 0.00 \\ 1.46 & -0.50 \\ 2.56 & 2.40 \\ 0.16 & -1.10 \end{bmatrix}, and d_2 = \begin{bmatrix} 1.53 & -2.95 \\ 4.43 & 0.25 \\ -1.67 & 0.65 \\ -3.97 & 3.75 \\ 2.33 & -3.05 \\ -2.67 & 1.35 \end{bmatrix}.

In the next two sub-sections, we’ll present how the two LDA approaches solve this numerical example.

7.1. Class-independent LDA Approach

In this approach, after the data centering process, we calculate the within-class variance S_{W_i} as follows: S_{W_i} = d_i^T * d_i. The values of S_{W_i} is as follows:

S_{W_1} = \begin{bmatrix} 18.57 & 6.31 \\ 6.31 & 7.86 \end{bmatrix}, and S_{W_2} = \begin{bmatrix} 53.07 & -30.09 \\ -30.09 & 34.38 \end{bmatrix}.

S_W becomes,

S_W = S_{W_1} + S_{W_2}.

S_W = \begin{bmatrix} 71.65 & -23.78 \\ -23.78 & 42.24 \end{bmatrix}.

Next, we calculate the transformation matrix W as follows: W = S_W^{-1}S_B. The values of S_W^{-1} are \begin{bmatrix}  0.02 & 0.01 \\ 0.01 & 0.03 \end{bmatrix}. The values of the transformation matrix become;

W = \begin{bmatrix}  0.15 & 0.30 \\ 0.28 & 0.56 \end{bmatrix}.

We then calculate the eigenvalues \lambda and eigenvectors V from W. Their values are as follows:

\lambda = \begin{bmatrix} 0.00 & 0.00 \\ 0.00 & 0.71 \end{bmatrix}, and V = \begin{bmatrix} -0.89 & -0.47 \\ 0.45 & -0.88 \end{bmatrix}

Next, we project w_1 and w_2 onto the first column vector V^{[1]} and the second column vector V^{[2]} of the matrix V to investigate which of the columns is more robust in discriminating the data.

Using V^{[1]}, we project w_1 on the lower dimensional space as follows:

y_1 = w_1V^{[1]},

y_1 = \begin{bmatrix} 2.3 & 1.2 \\ 0.8 & 2.0 \\ 5.1 & 1.5 \\ 6.2 & 4.4 \\ 3.8 & 0.9 \end{bmatrix} \begin{bmatrix} -0.89 \\ 0.45 \end{bmatrix} = \begin{bmatrix} -1.52 \\ 0.18 \\ -3.89 \\ -3.57 \\ -3.00 \end{bmatrix}

Using the same eigenvector V^{[1]}, we project w_2 on the lower dimensional space as follows:

y_2 = w_2V^{[1]},

y_2 = \begin{bmatrix} 6.4 & 1.5 \\ 9.3 & 4.7 \\ 3.2 & 5.1 \\ 0.9 & 8.2 \\ 7.2 & 1.4 \\ 2.2 & 5.8 \end{bmatrix} \begin{bmatrix} -0.89 \\ 0.45 \end{bmatrix} = \begin{bmatrix} -5.05 \\ -6.21 \\ -0.58 \\ -2.87 \\ -5.81 \\ 0.63 \end{bmatrix}.

Next, we project w_1 and w_2 to eigenvector V^{[2]} as follows:

y_1 = w_1V^{[2]},

y_1 = \begin{bmatrix} 2.3 & 1.2 \\ 0.8 & 2.0 \\ 5.1 & 1.5 \\ 6.2 & 4.4 \\ 3.8 & 0.9 \end{bmatrix} \begin{bmatrix} -0.47 \\ -0.88 \end{bmatrix} = \begin{bmatrix} -2.15 \\ -2.14 \\ -3.74 \\ -6.81 \\ -2.60 \end{bmatrix}

and

y_2 = w_2V^{[2]},

y_2 = \begin{bmatrix} 6.4 & 1.5 \\ 9.3 & 4.7 \\ 3.2 & 5.1 \\ 0.9 & 8.2 \\ 7.2 & 1.4 \\ 2.2 & 5.8 \end{bmatrix}\begin{bmatrix} -0.47 \\ -0.88 \end{bmatrix} = \begin{bmatrix} -4.35 \\ -8.54 \\ -6.01 \\ -7.65 \\ -4.46 \\ -6.15 \end{bmatrix}.

From our results, we observe that V^{[2]} is more robust in discriminating the classes w_1 and w_2 than V^{[1]}. This is because \lambda^{[2]} is larger than \lambda^{[1]}. Hence, we choose  V^{[2]} to discriminate w_1 and w_2 instead of V^{[1]}.

7.2. Class-dependent LDA Approach

In this approach, we aim to calculate a separate transformation matrix W_i for each class w_i. The within-class variance for each class is the same as we calculated it in the class-independent approach, we then calculate W_i of each class as follows: W_i = S_{W_i}^{-1}S_B. The values of W_1 and W_2 becomes:

W_1 = S_{W_1}^{-1}S_B

W_1 = \begin{bmatrix} 18.57 & 6.31 \\ 6.31 & 7.86 \end{bmatrix}^{-1}\begin{bmatrix} 4.10 & 8.20 \\ 8.20 & 16.37 \end{bmatrix} = \begin{bmatrix} -0.18 & -0.37 \\ 1.19 & 2.38 \end{bmatrix}, and

W_2 = S_{W_2}^{-1}S_B

W_2 = \begin{bmatrix} 53.07 & -30.09 \\ -30.09 & 34.38 \end{bmatrix}^{-1}\begin{bmatrix} 4.10 & 8.20 \\ 8.20 & 16.37 \end{bmatrix} = \begin{bmatrix} 0.42 & 0.84 \\ 0.61 & 1.21 \end{bmatrix}.

Next, we calculate the eigenvalues \lambda_i and the corresponding eigenvectors V_i of each transformation matrix W_i.

The values of \lambda_{w1} and V_{w1} of W_1 for w_1 are as follows:

\lambda_{w1} = \begin{bmatrix} 0.00 & 0.00 \\ 0.00 & 2.19 \end{bmatrix}, and the corresponding eigenvectors are; V_{w1} = \begin{bmatrix} -0.90 & 0.15 \\ 0.45 & -0.99 \end{bmatrix}

Similarly,the values of the eigenvalues \lambda_{w2} and the corresponding eigenvectors V_{w2} of W_2 for the second class data w_2 are as follows:

\lambda_{w2} = \begin{bmatrix} 0.00 & 0.00 \\ 0.00 & 1.64 \end{bmatrix}, and the corresponding eigenvectors are; V_{w2} = \begin{bmatrix} -0.90 & -0.57 \\ 0.45 & -0.82 \end{bmatrix}.

From the results above, we observe that V_{w1}^{[2]} of V_{w1} for w_1 has a bigger corresponding eigenvalue than V_{w1}^{[1]}, this implies that V_{w1}^{[2]} will discriminate w_1 better than the V_{w1}^{[1]}, thus, we chose V_{w1}^{[2]} as the lower dimensional space to project w_1 as follows;

y_1 = w_1V_{w1}^{[2]}

y_1 = \begin{bmatrix} 2.3 & 1.2 \\ 0.8 & 2.0 \\ 5.1 & 1.5 \\ 6.2 & 4.4 \\ 3.8 & 0.9 \end{bmatrix}\begin{bmatrix} 0.15 \\ -0.99 \end{bmatrix} = \begin{bmatrix} -0.84 \\ -1.85 \\ -0.71 \\ -3.41 \\ -0.31 \end{bmatrix}.

For the second class w_2, we also see that the second column eigenvector V_{w2}^{[2]} of V_{w2}  has a higher eigenvalue than the first column eigenvector V_{w2}^{[1]}, this also implies that V_{w2}^{[2]} will discriminate better than the V_{w2}^{[1]}. We obtain the projected data y_2 for w_2 as follows;

y_2 = w_2V_{w2}^{[2]}

y_2 = \begin{bmatrix} 6.4 & 1.5 \\ 9.3 & 4.7 \\ 3.2 & 5.1 \\ 0.9 & 8.2 \\ 7.2 & 1.4 \\ 2.2 & 5.8 \end{bmatrix}\begin{bmatrix} -0.57 \\ -0.82 \end{bmatrix} = \begin{bmatrix} -4.90 \\ -9.16 \\ -6.01 \\ -7.25 \\ -5.26 \\ -6.02 \end{bmatrix}.

In the next sub-section, we present the visual illustration of the performance of LDA on the two classes of data w_1 and w_2 and then draw our conclusions.

7.3. Performance Analysis of Class-independent and Class-dependent LDA Approach

First, let us consider the class-independent LDA approach.

The next figures illustrate the probability density function (pdf) graphs of the projected data y_1 and y_2 on the two eigenvectors V^{[1]} and V^{[2]} of matrix V:

A comparison of the two figures reveals the following:

  • V^{[2]} (second figure) showed a better discrimination of the data of each class than V^{[1]} (first figure). This means that V^{[2]} maximizes the between-class variance S_B than V^{[1]}
  • V^{[2]} minimized the within-class variance S_{W_i} of the data of both classes than V^{[1]}
  • By using some techniques, we can easily handle the outlier of y_1 in the second figure

Let us investigate the class-dependent LDA approach:

We observe that by projecting w_1 onto V^{[2]} of the first eigenvector and w_2 onto V^{[2]} of the second eigenvector, we were able to discriminate the data efficiently. The corresponding eigenvectors achieved this by maximizing S_B and minimizing S_W of the data.

8. Conclusion

In our article, we explored linear discriminant analysis (LDA) through class-independent and class-dependent approaches. Our analysis revealed that while the class-dependent method is efficient, it demands more computational time than the class-independent approach. Consequently, the class-independent approach is commonly favoured as the standard method for LDA.