1. Introduction

Dimensionality reduction methods transform high-dimensional data into lower-dimensional spaces. T-distributed Stochastic Neighbor Embedding (t-SNE) is a popular technique for this.

In this tutorial, we’ll review t-SNE and how to interpret t-SNE plots.

2. High Dimensional Data

High-dimensional data refers to any dataset where the number of features is comparable to or exceeds the number of elements (observations). 

Let’s consider a healthcare dataset consisting of patient records. The observations in this context correspond to patients’ visits to the healthcare facility. Each visit is a vector of features specifying the patient’s health history, previous diagnoses, age, height, weight, race, vital signs, medications, current symptoms, and various diagnostic test results.

The goal is to train a classifier for diagnosing patients. However, the medical facility that compiles the dataset may have just a few patients who consent to recording their data. As a result, the number of patients (observations) in the dataset may be fewer than the number of recorded attributes.

2.1. Dimensionality Reduction

High-dimensional data leads to complications in modeling and analysis. In machine learning, high-dimensional data leads to the curse of dimensionality. The most effective solution is to reduce the number of dimensions in the data, and this is where dimensionality-reduction techniques come into play.

The sole purpose of these techniques is to transform data into a space with fewer features while maintaining their essential properties. Consequently, the transformed dataset is easier to interpret and use for downstream analysis.

3. How Does t-SNE Work?

t-SNE, as proposed by Laurens van der Maaten and Geoffrey Hinton, is a non-linear method that reduces high-dimensional data to a lower-dimensional space, typically two or three dimensions. The main idea is that the points that are similar/close in the original space should also be close in the new space, while the points far apart in the original space should remain distant.

In t-SNE, we start by modeling the pairwise distances between the data points with a probability distribution. Let X = \{x_1, x_2, \ldots, x_n\} be the dataset in the original high-dimensional feature space and Y = \{y_1, y_2, \ldots, y_n\} its mapping into a low-dimensional feature space.

For x_i,x_j in X,  we compute  p_{j|i}, the probability that x_j  is a neighbour to x_i  under a Gaussian distribution:

    [ p_{j|i} = \frac{\exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma_i^2}\right)}{\sum_{k \neq i} \exp\left(-\frac{\|x_i - x_k\|^2}{2\sigma_i^2}\right)} ]

In the lower-dimensional space Y,  the distances are modeled with a t-distribution. For a given mapping, we can compute

    [ q_{j|i} ]

, the probability that y_j  is a neighbour to y_i  under a t-distribution:

    [q_{j|i} = \frac{\exp\left(-\|y_i - y_j\|^2\right)}{\sum_{k \neq i} \exp\left(-\|y_i - y_k\|^2\right)}]

t-SNE uses the Kullback-Leibler Divergence (KLD) as a cost function with gradient descent to optimize the match between the distributions in the original space and those in the transformed space. Thus, optimizing the KLD with respect to y_1, y_2, \ldots, y_n, we get:

    [ \text{KL}(P_i \| Q_i) =\sum_j p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} ]

After completing the process, we can visualize and analyze the reduced dataset to reveal its properties.

4. Executing t-SNE

So, to run t-SNE on a dataset, we follow these steps.

4.1. Step 1: Data Preparation

First, the data should be cleaned and preprocessed to make it suitable for further analysis. This typically involves missing data imputation, removing outliers, and formatting data types.

Additionally, the data should be normalized if the features are of different ranges to ensure that all features are on the same scale.

4.2. Step 2: Select the Number of Components

The second step is to select the number of dimensions of the new space into which we’ll map the data.

This parameter is typically set to two or three dimensions in t-SNE.

4.3. Step 3: Set and Tune Hyperparameters

The next step is to define the t-SNE’s hyperparameters. The three main hyperparameters of t-SNE are perplexity, the learning rate, and the number of iterations.

Perplexity determines the number of nearest neighbors to consider, while the learning rate determines the step size for optimizing the position of data points in the low-dimensional space.

4.4. Step 4: Visualize the Results

After executing the algorithm, the last step is visualizing the transformed data. We plot the transformed representation y_1, y_2, \ldots, y_n of the data.

Visualization aids in identifying the inherent patterns in the data that we can’t spot in high-dimensional spaces.

5. How Do We Interpret a t-SNE Plot?

A t-SNE visualization shows the general structure of the data, highlighting the local patterns.

The first step in interpreting the plot is to inspect whether there are clusters in the visualization. Clusters indicate the data points that share similar characteristics.

Next**,** we can examine the separation between the clusters. If the clusters overlap, the data points in those clusters share strong similarities. On the other hand, if clusters are clearly distinct, this typically indicates that the dataset contains unambiguously differing groups of data points.

Similarly, in the t-SNE plot, we can examine the data distribution. Data points clustered closely together typically signify strong correlations. Conversely, when data points scatter randomly across the 2D or 3D space, it suggests a lack of strong relationships within the data.

5.1. An Example

To illustrate the interpretation of a t-SNE plot, let’s generate a random dataset with 250 features and 250 data points. After applying t-SNE, we visualize the transformed data in a 2D space:

a t-SNE plot on a random dataset

The t-SNE plot shows no distinct clusters within the data. The data points are scattered across the 2D space, correctly suggesting a lack of strong correlations between the features and the data points.

Similarly, let’s consider a second dataset with 250 features and 250 data points from two distinct classes. The classes come from two Gaussian distributions having different means. Applying t-SNE to this data and visualizing it in a 2D space, we get:A t-SNE plot over a dataset with two distinct classes

Contrary to the first plot, the t-SNE plot for the second dataset shows two distinct clusters corresponding to the two classes. The two clusters don’t overlap, indicating a weak correlation between different features in the data.

Finally, the cluster structures show that the data points within each class are similar. However, the points from one class are distinct from those of the other.

6. Conclusion

In this article, we covered the basics of t-SNE and how to interpret t-SNE plots.

They are used to transform data with many dimensions into data with fewer dimensions. Secondly, a t-SNE plot will show the general structure of the data, highlighting the inherent relationships between the features or data points or a lack thereof.