1. Overview
An inverted index is a data structure used to store and organize information for efficient search and retrieval.
In this tutorial, we’ll take a closer look at the inverted index, how it works, its advantages and limitations, and its applications in various domains.
2. How an Inverted Index Works
An inverted index works by mapping each unique word or term in a set of documents to the documents in which it occurs. This is in contrast to a forward index, which maps each document to the words it contains. This is visualized in the next figure:
Forward Index
Inverted Index
Docs
Content
Keywords
Docs
1
Content1
Keyword1
1,2
2
Content2
Keyword2
2
3
Content3
Keyword3
2,3
An inverted index is based on three main concepts: terms, documents, and index. Terms are the unique words or phrases that are found in the documents. Documents are the individual pieces of content being indexed, such as web pages. Index contains the mappings of a term to a document and may include additional information, such as the term’s location within the document.
To create an inverted index, the text of each document is first preprocessed by removing stop words, applying stemming, and using other techniques to normalize the text. Then, the text is tokenized, meaning that it is split into individual terms. The terms are then added to the index, with each term pointing to the documents in which it appears. This is done by creating an index for each term-document pair, which contains information such as the document ID, the term frequency (i.e., how often the term appears in the document), and the position of the term within the document.
In the next figure, the typical inverted index is given:
Documents
Terms
Index
DocId
Docs
TermId
Term
Terms
Docs
1
To be or not to be
1
be
1
1:2:[2,6], 2:1:[2], 3:1:[3]
2
To be right
2
left
2
3:1:[4]
3
Not to be left
3
not
3
1:1:[4], 3:1:[1]
4
or
4
1:1:[3]
5
right
5
2:1:[3]
6
to
6
1:2:[1,5], 2:1:[1], 3:1:[2]
To create an inverted index, the text of each document (the first table on the right in the figure) is first preprocessed by removing stop words, stemming, and other techniques to normalize the text. Then, the text is tokenized, meaning that it is split into individual terms. The obtained table is a term table where terms are added in sorted order (the second table).
The terms are then added to the index, each pointing to the documents it appears in. This is depicted in the third table. In it, each term-document pair contains information such as the document ID, the term frequency (i.e., how often the term appears in the document), and the position of the term within the document.
To better present the internal structure of the index, we will focus on the first entry. The column term is with id 1, which is the token be, and the column document has a value of a list . The first element of this list is , which means that the term occurs in document 1, with frequency 2, at positions 2 and 6.
3. How Is a Search Query Executed?
When a search query is executed, the query is first tokenized, and the individual terms are looked up in the inverted index. For each term, the index returns a list of documents that contain the term, along with information about the term’s frequency and position within each document. These lists are then combined and ranked according to a relevance score, which considers factors such as term frequency, document length, and the proximity of the terms within the document.
The highest-ranked documents are then returned as search results:
Original Query
To do right
Query terms
to, do, right
Term
TermId
Docs from index
do
na
na
to
1
1:2:[2,6], 2:1:[2], 3:1:[3]
right
5
2:1:[3]
Results:
DocId=1; To be or not to be
DocId=2; To be right
DocId=3; Not to be left
4. Conclusion
In this article, we discussed an inverted index concept.
Overall, using an inverted index allows for faster and more precise search results, as it reduces the amount of data that needs to be searched and allows for more complex queries to be executed efficiently.