DA4. IDF and Parametric vs Zone Indexes¶

Statement¶

What is involved in determining and calculating the Inverse Document Frequency? Also, what are the differences between parametric and zone indexes?

Solution¶

According to (Manning et al., 2009), Calculating the IDF (Inverse Document Frequency) is done by first processing the entire collection to gather statistics about terms and documents, and then calculating the IDF for each term.

The process starts by converting each document to a bag of words, where the document is represented as a set of words; in a way that the order of these words is not important; however, the important part is the number of times a word appears in the document (term frequency).

The process, then, continues by calculating the document frequency of each term, which is the number of documents in the collection that contain the term.

The IDF is calculated for each term, by dividing the total number of documents in the collection (N) by the document frequency of the term (df(t)), and then taking the logarithm of the result; as shown in the following equation:

\[ IDF(t) = \log(\frac{N}{df(t)}) \]

The IDF of rare terms is high, and the IDF of common terms is low, as we can see by examining the equation above.

The Parametric Index indexes documents by their metadata, such as language, author, date, etc. The metadata of a document consists of fields; these fields have limited possible values and limited size. Common fields include language(one word, there are only a few hundred languages), author (a few words), date of creation, etc. The Parametric Index has one index per field of metadata. It is useful for simple queries such as “find all documents in English”, or “find all documents written by John Smith” (Manning et al., 2009).

The Zone Index, on the other hand, indexes terms by the importance of their location in the document; that is, the document is divided into zones, and each zone is indexed separately; the values of the zones are free text and it allows for lengthier values than fields. Common zones include title, abstract, etc. (Manning et al., 2009). It is important for scoring as zones have different importance; for example, the title of a document is more important than the footer, and if a query term appears in the title, it is more likely to be relevant.

To Conclude, the IDF is an important factor in scoring, as it is used in the calculation of more complex scoring factors such as the TF-IDF; and it is usually saved with Posting data. The Parametric Index and the Zone Index are both important for scoring, as they allow for more complex scoring factors to be calculated, such as the weighted zone scoring, and learning weights (Manning et al., 2009).

References¶

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/05comp.pdf. Chapter 6: Scoring, Term Weighting, and the Vector Space Model.