1. Overview
In this tutorial, we’ll explore various algorithms for finding the top 10 search terms in a given dataset.
We’ll review various approaches, including frequency counting, sorting, and data structure-based methods.
2. Assumptions
Before we delve into the topic, clarifying some fundamental concepts is important.
2.1. Dataset
A dataset is a collection of search terms to analyze. We can define two types of datasets:
- Finite datasets: they have a fixed number of records
- Data streams: a continuous flow of data generated by various sources
Finite datasets can use algorithms that require multiple passes, like sorting or hash tables. The dataset size affects memory and processing time.
Data streams can’t be stored, so algorithms must be fast, scalable, and memory-efficient. They often approximate or estimate data.
2.2. Storage
Considering a finite data set, storing the data is only possible if we have sufficient storage capacity.
However, we may still need to employ data stream algorithms when storage is limited.
Using data stream algorithms, we can handle finite data sets as if they were infinite data streams, processing each data element as it becomes available. This approach allows us to perform real-time data analysis without storing the entire data set in memory or on disk.
3. Data Structure
When performing a top- search, we can use several useful data structures to efficiently retrieve the most relevant elements from a collection.
3.1. Hashtable
Hashtable is a data structure that stores the frequency of each search term in an array-like structure.
Once we have updated the frequency, we may sort it via a sorting algorithm, and then select the top – k with a higher frequency:
algorithm FindTopKMostFrequentWords(hashtable, minFrequency, k):
// INPUT
// hashtable = a hashmap storing the frequency of each search term
// minFrequency = the minimum frequency value to consider
// k = the number of top frequent words to find
// OUTPUT
// top_K = a list of the top k most frequent words
top_K <- []
for key in hashtable.keys():
freq <- hashtable[key]
if freq > minFrequency:
append (key, freq) to top_K
sort top_K nonincreasingly
top_K <- top_K[1:k]
(minKey, minFrequency) <- the last element of top_K
return top_K
This algorithm maintains a list of the top k elements and a variable minFrequency that keeps track of the minimum frequency among the top k elements.
We iterate through all the keys in the hashtable, and if the frequency associated with the key is greater than minFrequency, we add the key and frequency to the topKList.
Then we sort the list in descending order by frequency and keep only the top k elements.
We update MIN_FREQ to be the minimum frequency among the top k elements in topKList.
Time complexity depends on the hash function used, the number of elements in the table, and the number of collisions.
The average time complexity for inserting, deleting, and searching for an element in a hash table is O(1) in the ideal case, but it can be O(n) in the worst case.
Space complexity is O(n), where n is the number of elements in the table.
3.2. Trie Tree
A Trie Tree, also known as a prefix tree, is a tree data structure that stores a collection of strings, such as search terms.
In a Trie Tree, each node represents a single character in a string, and each edge represents a character transition from one node to another.
So the root node represents an empty string, and the leaf nodes represent complete strings.
The algorithm assumes that each node in the trie tree contains a value property indicating the character it represents, a frequency property indicating the frequency of the term represented by the node, and a children property containing an array of child nodes:
algorithm TraverseTrie(tempList, node, prefix, frequency):
// INPUT
// tempList = an empty list to store the results
// node = the current node in the trie
// prefix = the string built so far
// frequency = the frequency of the current node
// OUTPUT
// the traversal of the trie tree and appending words to tempList
if node.isWord:
tempList.append((prefix, node.frequency))
for child in node.children:
TraverseTrie(tempList, child, prefix + child.value, node.frequency)
So our algorithm will process the trie tree:
algorithm TopKSearchTermsTrieTree(trieTree, k):
// INPUT
// trieTree = the trie tree containing the search terms
// k = the number of top search terms to find
// OUTPUT
// top_K = the list of top k search terms
top_K <- an empty array
tempList <- an empty array
traverseTrie(tempList, trieTree.root, "", 0)
sort tempList nonincreasingly
for i <- 1 to k:
if len(tempList) > i:
append (prefix, node.frequency) to top_K
return top_K
The root property of the trie object should point to the root node of the trie.
The traverseTrie function recursively traverses the tree, building up the prefix of each word as it goes, and adding any complete words it encounters to the temp_list along with their frequency.
Time complexity depends on the size of the keys, the number of keys and the number of operations.
The average time complexity for insertion and search operations is O(n), where n is the length of the key being inserted or searched for.
In the worst case, time complexity can be O(nm), where m is the number of keys in the trie tree.
Space complexity is O(nm), where m is the number of keys in the trie tree and n is the length of the longest key.
3.3. Heap
A heap is a tree-like data structure in which each node’s value is greater or smaller than its children’s, depending on whether it’s a max-heap or a min-heap.
This allows for maintaining the maximum or minimum value element as the root of the structure.
Initially, the algorithm initializes the heap with the first k elements from the dataset.
For each subsequent element, it compares the element’s value with the minimum or maximum value of the heap.
If the new element is larger or smaller, the algorithm replaces the minimum or maximum value, and the heap is reorganized:
algorithm FindTopKFromHeap(heap, k):
// INPUT
// heap = a heap data structure containing search terms
// k = the number of top terms required
// OUTPUT
// top_K = list of top k terms
top_K <- an empty array
for i <- 1 to k:
if length(heap) > 0:
item <- heap.pop()
append ite.term to top_K
return top_K
heap.pop() remove and return the highest item.
Time complexity is O(log n) where n is the number of elements in the heap.
Space complexity is O(n).
4. Datastream
The ever-increasing volume of daily data generates challenges in processing and analyzing them effectively.
Traditional algorithms often do not handle large datasets that cannot fit into memory. In this context, data stream algorithms have emerged as a promising approach to handling continuous, high-speed data streams.
So let’s explore data stream algorithms and their applications.
These algorithms are approximation algorithms and may not be accurate for small precision values.
The accuracy of these algorithms depends on the trade-off between memory and accuracy, making them useful tools in data stream processing.
4.1. Space-Saving
The Space-Saving Algorithm ranks and tracks the most frequent search terms in real-time.
This algorithm uses a dictionary to keep track of the frequency counts of the terms encountered so far. When a new element is received, the algorithm checks if the element is already present in the dictionary.
Therefore, if the word is present, the corresponding frequency counter of the element is incremented.
Otherwise, the algorithm replaces the element with the minimum frequency counter in the queue with the new element and updates its frequency counter:
algorithm FindTopKTerms(terms, k):
// INPUT
// terms = a list of search terms
// OUTPUT
// top_K = the top k terms with the highest frequency
frequencyCounts <- an empty dictionary
for term in terms:
if term in frequencyCounts:
frequencyCounts[term] <- frequencyCounts[term] + 1
else if length(frequencyCounts) < k:
frequencyCounts[term] <- 1
else:
minTerm <- getTermWithMinFrequency(frequencyCounts)
frequencyCounts[term] <- frequencyCounts[minTerm] + 1
frequencyCounts.pop(minTerm)
top_K <- sortTermsByFrequency(frequencyCounts)[1:k]
return top_K
When a new term is encountered, the algorithm identifies the term with the smallest count using the getTermWithMinFrequency() function. It replaces that term with the new term, incrementing its count to one greater than the count of the removed term.
Time complexity is O(1) for each item insertion and O(k) for finding the top-k items, where k is the number of items we want to find.
Space complexity is O(k) since it maintains a data structure with at most k items.
4.2. Count-Min Sketch
Count Min Sketch is an algorithm useful to process data streams in real-time. It approximates the frequency of items in a stream by maintaining a fixed-size hash table:
algorithm TopKSearchSketchTrie(terms, width, depth):
// INPUT
// terms = a list of search terms
// width = the width of the sketch table
// depth = the depth of the sketch table
// OUTPUT
// The list of top k terms
frequencyCounts <- createSketchTable(width, depth)
for term in terms:
for i <- 0 to depth - 1:
hashVal <- hash(term, i)
frequencyCounts[i][hashVal] <- frequencyCounts[i][hashVal] + 1
top_k <- an empty array
for term in terms:
minCount <- infinity
for i <- 0 to depth - 1:
hashVal <- hash(term, i)
count <- frequencyCounts[i][hashVal]
minCount <- min(minCount, count)
(last_count, last_term) <- the last element of top_k
if length(top_k) < k or minCount > last_count:
append (minCount, term) to top_k
sort top_k nonincreasingly
top_k <- top_k[1:k]
return the first elements of the tuples in top_k
This algorithm creates a sketch table with a designated width and depth and proceeds to iterate over each term to increment their corresponding frequency counts in the table.
Afterwards, it identifies the top k terms with the highest frequency counts by looping over each term and determining the minimum count across all hash functions. The algorithm stores the top k terms in a list and updates it if a term with a higher frequency count is found.
To map each term to a unique value in the sketch table, the algorithm uses a hash function that considers the current depth to guarantee the uniqueness of each hash value.
Time complexity is O(1) for each query and update operation.
Space complexity is O(d*w) where d is the number of hash functions and w is the width of the hash table.
4.3. HyperLogLog
HyperLogLog estimates the number of unique items in a stream without storing all the items in memory. Instead, it uses a hash function to map each item to a specific bucket and maintains a value for each bucket.
In fact, the basic idea is to use the HyperLogLog algorithm to estimate the frequency of each term in the stream and then use these estimates to find the top-k search terms.
We create a sketch table with multiple hash functions and buckets to accomplish this. Each time a term appears in the stream, we hash it multiple times using the hash functions and increment the corresponding counters in the sketch table.
4.4. T-Digest
T-Digest: it divides data into k clusters and uses the centroids to represent the distribution.
The basic idea is to use T-Digest to maintain a data histogram and then extract the k most frequent items from the histogram.
So we can first create a T-Digest of the data stream. Then, we can iterate through the centroids of the T-Digest and use the centroid values as frequency counts for each item. We can then extract the k most frequent items from this list of frequency counts.
5. Conclusion
In this article, we have analyzed the algorithms and data structures to find the top 10 most searched terms.
We comprehend that the algorithm chosen primarily depends on the dataset type, followed by the available resources for storing it.