1. Overview
The Locality-Sensitive Hashing (LSH) algorithm hashes input items so that similar items have a high probability of being mapped to the same buckets.
In this quick article, we will use the java-lsh library to demonstrate a simple use case of this algorithm.
2. Maven Dependency
To get started we’ll need to add Maven dependency to the java-lsh library:
<dependency>
<groupId>info.debatty</groupId>
<artifactId>java-lsh</artifactId>
<version>0.10</version>
</dependency>
3. Locality-Sensitive Hashing Use Case
LSH has many possible applications, but we will consider one particular example.
Suppose we have a database of documents and want to implement a search engine that will be able to identify similar documents.
We can use LSH as part of this solution:
- Every document can be transformed to a vector of numbers or booleans – for example, we could use the word2vect algorithm to transform words and documents into vectors of numbers
- Once we have a vector representing each document, we can use the LSH algorithm to calculate a hash for each vector, and due to the characteristics of LSH, documents that are presented as similar vectors will have a similar or same hash
- As a result, given a particular document’s vector, we can find N numbers of vectors that have a similar hash and return the corresponding documents to the end user
4. Example
We will be using the java-lsh library to calculate hashes for our input vectors. We won’t be covering the transformation itself, as this is a huge topic beyond the scope of this article.
However, suppose we have three input vectors that are transformed from a set of three documents, presented in a form that can be used as the input for the LSH algorithm:
boolean[] vector1 = new boolean[] {true, true, true, true, true};
boolean[] vector2 = new boolean[] {false, false, false, true, false};
boolean[] vector3 = new boolean[] {false, false, true, true, false};
Note that in a production application, the number of input vectors should be a lot higher to leverage the LSH algorithm, but for the sake of this demonstration, we will stick to three vectors only.
It is important to note that first vector is vastly different from the second and third, whereas the second and third vectors are quite similar to each other.
Let’s create an instance of the LSHMinHash class. We need to pass the size of the input vectors to it – all input vectors should have equal size. We also need to specify how many hash buckets we want and how many stages of computation (iterations) LSH should perform:
int sizeOfVectors = 5;
int numberOfBuckets = 10;
int stages = 4;
LSHMinHash lsh = new LSHMinHash(stages, numberOfBuckets, sizeOfVectors);
We specify that all vectors that will be hashed by the algorithms should be hashed among ten buckets. We also want to have four iterations of LSH for calculating hashes.
To calculate the hash for each vector, we pass the vector to the hash() method:
int[] firstHash = lsh.hash(vector1);
int[] secondHash = lsh.hash(vector2);
int[] thirdHash = lsh.hash(vector3);
System.out.println(Arrays.toString(firstHash));
System.out.println(Arrays.toString(secondHash));
System.out.println(Arrays.toString(thirdHash));
Running that code will result in output similar to:
[0, 0, 1, 0]
[9, 3, 9, 8]
[1, 7, 8, 8]
Looking at each output array, we can see the hash values calculated at each of the four iterations for the corresponding input vector. The first line shows the hash results for the first vector, the second line for the second vector, and third line for the third vector.
After four iterations, the LSH yielded results as we expected – LSH calculated the same hash value (8) for the second and third vectors, which were similar to each other, and a different hash value (0) for the first vector, which was different from the second and third vectors.
LSH is an algorithm that is based on probability, so we cannot be sure that two similar vectors will land in the same hash bucket. Nevertheless, when we have a large enough number of input vectors, the algorithm yields results that will have a high probability for assigning similar vectors to the same buckets.
When we are dealing with massive data sets, LSH can be a handy algorithm.
5. Conclusion
In this quick article, we looked at an application of the Locality-Sensitive Hashing algorithm and showed how to use it with the help of the java-lsh library.
The implementation of all these examples and code snippets can be found in the GitHub project – this is a Maven project, so it should be easy to import and run as it is.