DA1. Biword Index¶
Statement¶
In unit one, we are introduced to the concept of the inverted index as a fundamental technology in information retrieval systems. The inverted index essentially is an index of words known as terms extracted from the document corpus that can be searched to find documents with the content that the user is looking for. Our text also introduces two extensions to the concept of the inverted index, the biword index and the positional index.
For your discussion assignment:
- Select either the biword index or positional index
- Provide a description of the index that you selected. As part of your description make sure that you describe why and how it is different than the inverted index.
- Describe both where and when the index would it be used
- Describe the advantage the index has over the inverted index
Solution¶
A Phrase Query is a query that consists of multiple terms that appear in the same order in the document, or a literal search query, that is, the occurrence of all terms in the query in a different order is not considered a match. E.g. Car Crash
is a phrase query that matches Car Crash
but not Crash Car
.
Answering phrase queries requires more than just a simple inverted index with simple postings lists.
Both biword index
and positional index
are extensions to the inverted index that allow for answering phrase queries without harming the main functionality of the inverted index to answer Boolean queries, and with good precision, recall, and response time (Manning et al., 2009, p.39).
Literal phrase queries are usually surrounded by quotation marks (“”), but a good percentage of user queries are implicitly phrase queries, that is, the quotes are omitted, but the query is a legitimate phrase query; thus, the system need to handle both explicit and implicit phrase queries.
The Biword Index is an enhanced form of the traditional inverted index, suited for handling phrase queries, and treats every two consecutive words in a document as a distinct phrase (Manning et al., 2009, p.39).
The biword index differs from the inverted index in two main issues (According to Manning et al., 2009, p.39):
- The inverted index considers every word in the document as a token, while the biword index considers every two consecutive words as a token; which may make sense as some words are meaningless without the other or give a different meaning when used together.
- The inverted index do not ignore any word, while the biword index ignores some words that are not useful in phrase queries (e.g.
and
,of the
, etc) with special focus on nouns, proper nouns and function words.
So the biword index is used mostly for phrase queries like in search engines, and when the query appears to be normal boolean query, but it is implicitly a phrase query (e.g. when the user forgets to put the quotation marks).
The biword index has an advantage over the inverted index in handling phrase queries, where an inverted index will have to get all postings for each term in the query and intersect them to get to the final result, while biword index will pick up those postings supe quickly.
As we saw, the biword index is an enhanced form of the traditional inverted index, and it have some advantages over the inverted index, but it may have some false positives, that is, the response will contain irrelevant documents; but all modern search engines rely on inverted index in some form or another (Anderton, n.d, p.5).
References¶
- Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available at https://nlp.stanford.edu/IR-book/pdf/02voc.pdf Chapter 2: Terms and Postings.
- Anderton J. (n.d). Indexing. Northeastern University. https://www.khoury.northeastern.edu/home/vip/teach/IRcourse/2_indexing_ngrams/slides/indexing_1.pdf