1. Introduction
In this tutorial, we’ll learn why applying Euclidean metrics for high-dimensional datasets may be a bad idea and what other metrics we can use instead.
2. Euclidean Distances in Low Dimensions
We define the Euclidean distance between two vectors as the L2-norm of their difference. This metric is commonly used to calculate the distance between pairs of points. Therefore, if we work with points on a 2D-plane, and , their Euclidean distance is:
(1)
In an -dimensional space, and will have components each: and . The Euclidean distance between and in dimensions is:
(2)
2.1. Euclidean Distance and Multidimensional Datasets
However, this metric isn’t always the best choice when working with datasets, especially when they have many dimensions.
Generally, we want a distance metric in line with the geometric properties of the points or observations. For example, for two pairs of points centrally symmetric about the origin and at a similar distance, we’d like their pair-wise distance to scale up similarly among all pairs as we increase the number of dimensions.
For example, let’s say we select two pairs of objects (A, B) and (C, D) because they are both centrally symmetric:
We’d expect the distances between A and B, on one hand, and C and D, on the other, to increase similarly as the number of dimensions of the cube increases. However, as we’ll see shortly, this property is lost rapidly for the Euclidean metric.
3. Squares, Cubes, and Hypercubes
We can see why this is the case by considering the vector set corresponding to the vertices of the -cube centered at the origin of an -dimensional space. Let every cube’s side have length 2. For a 2D plane, the -cube is a square, like the one we saw in the previous section.
For a 3D space, we have a regular cube:
Lastly, for spaces with more than three dimensions, these objects take the name of hyper-cubes.
3.1. How Do Distances Vary as Dimensionality Increases?
Let’s now consider the distance between the two pairs of points: the ones lying at opposite corners of the hypercube () and the ones lying at opposite faces (). What happens to those distances as grows?
We expect that increasing the dimensions of the vector space will increase the distances between points in a pair regardless of their positions in the cube.
Let’s check if this is the case. For the opposite corners, the Euclidean distance of and is a square function of :
(3)
What happens to the distance between points and on opposite faces? Their Euclidean distance as a function of is a constant:
(4)
The distance between two points on opposite faces remains constant as the number of dimensions increases.
So, how the distance between pairs of objects changes as grows depends on the region of the vector space in which they are, even though we’d expect the distances of similar objects to change similarly. Therefore, when working with high dimensions, the intuitive notion of “pairs of objects that have been selected with similar criteria behave similarly” isn’t always valid.
4. Should We Use the Euclidean Metric in High Dimensions?
In multidimensional datasets, it may matter less how far each observation is from the origin than the region of the vector space it is in.
To address this problem, we can use alternative metrics, such as cosine similarity, that work precisely upon these assumptions. Cosine similarity, in particular, only considers the angle an observation has with the bases of its vector space. The comparison between Euclidean distance and cosine similarity shows that the latter is more suitable for working with clusterization algorithms, such as K-Nearest Neighbors when the dataset has more than a few features. Additionally, we might consider using other metrics, such as the Haversine distance, when working with vectors that intrinsically represent angles, such as coordinates on Earth’s surface.
Ultimately, the metric to use depends upon the problem we’re solving, so it’s always a good idea to keep many metrics in our toolbox and choose one appropriate for the given task.
5. Conclusions
In this article, we studied why the Euclidean metric isn’t suitable for high-dimensional datasets and have considered some alternative metrics to use in its place.
In particular, we observed how the Euclidean distances between pairs of points can change differently with increasing dimensionality based on their positions in the vector space. This may not preserve the intuitive notion that points selected similarly, for example, using the criterion of central symmetry, behave similarly as the number of dimensions increases.