WA6. Evaluation in Information Retrieval¶

Statement¶

Unit six looks at how to evaluate the effectiveness of an information retrieval system. Precision, recall, accuracy, and the F measure are all discussed as metrics that can be used to measure the effectiveness of results retrieved from an IR system. In chapter 8 of the text, we learn that there are a number of document collections (corpus) that are used for this purpose. These document collections are used along with well-known queries (keep in mind that a query contains the terms that are used to search the collection). An IR system to be evaluated first indexes the corpus and then the queries are used to test the results that the IR system returns. What is important about this process is that for these queries there are known metrics such as the number of documents in the collection that SHOULD be relevant.

These measures of effectiveness are calculated based on such known information and the results returned from a query submitted to an IR system.

For example, consider the corpus used in our course. We know that it contains 570 documents that come from the communications of the Association of Computing Machinery journal (ACM), which can be downloaded in the resources section of unit 2. Suppose that we have known metrics such as the fact that there are documents in this collection that are relevant to a query for the terms ‘Simpson’ and ‘algorithm’.

Assume that the IR system that we have developed returns 8 relevant documents and 10 documents that are not relevant. Using this information and the formulas for Precision, Recall, F-measure, and Accuracy, calculate what each of these measures would be for the example presented above. When you have determined the metric for each post a response that includes:

The Precision, Recall, F-Measure, and Accuracy effectiveness metrics which you will calculate using the metrics provided above.
Discuss which approach provides the most valid measure of the effectiveness of the IR system and why.

Keep in mind that Precision and Recall are used together to measure effectiveness, the F-Measure provides a single measure that balances Precision and Recall metrics and Accuracy provides a measure of the accuracy of classifications in the collection.

Solution¶

Let’s first prepare all the parameters that we need to calculate the metrics.

	Relevant	Not Relevant
Retrieved	TP = 8	FP = 10
Not Retrieved	FN = 0	TN = 552

The total number of documents in the collection is 570.
In our discussion below, we will assume that the query has returned all relevant documents, and there are no more relevant documents that are not retrieved.
The TP is the number of documents that are relevant and retrieved, which is 8 (from the problem statement).
The FP is the number of documents that are not relevant but retrieved, which is 10 (from the problem statement).
The FN is the number of documents that are relevant but not retrieved, which is 0 (we assumed that the query has returned all relevant documents).
The TN is the number of documents that are not relevant and not retrieved, which is 552 (all not relevant documents = 570 - 8 = 562, but we already retrieved 10 of them, so TN = 562-10 = 552).

Second, let’s calculate the metrics.

Precision = TP / (TP + FP) = 8 / (8 + 10) = 0.44 = 44%
Recall = TP / (TP + FN) = 8 / (8 + 0) = 1 = 100%
F-Measure = 2 * Precision * Recall / (Precision + Recall) = 2 * 0.44 * 1 / (0.44 + 1) = 0.61 = 61%
Accuracy = (TP + TN) / (TP + TN + FP + FN) = (8 + 552) / (8 + 552 + 10 + 0) = 0.98 = 98%

Third, let’s define each metric, and talk about what it means; in general and in the context of our problem (Manning et al., 2009):

Precision: is the fraction of retrieved documents that are relevant to the query; it measures the exactness; that is, how many of the retrieved documents are relevant, and how many are not. In our problem, the precision is 44%, which means that for every query, we will get 44% of relevant documents and 56% of irrelevant documents.
Recall: is the fraction of relevant documents that are retrieved; it measures the completeness; that is, how many of the relevant documents are retrieved, and how many are still missing. In our problem, the recall is 100%, which means that for every query, we will get all relevant documents, which indicates that the IR system is doing a full scan over the entire collection in one go (no pagination).
F-Measure: is the harmonic mean of precision and recall; it measures the balance between precision and recall. In our problem, the F-measure is 61%, which means that if we want to balance between precision and recall, we will get 61% of relevant documents and 39% of irrelevant documents.
Accuracy: is the fraction of documents that are correctly classified; it measures the correctness; that is, how many of the documents are correctly classified, and how many are not. In our problem, the accuracy is 98%, which means that data is correctly classified 98% of the time.

The most valid measure of the effectiveness of the IR system is Accuracy, in my opinion, because it measures how data is gathered, parsed, and classified; and a good classification usually leads to good results that increase user satisfaction, which is the main purpose of any information retrieval system; plus, a good classification will lead to a good precision and recall, which will lead to a good F-measure.

To conclude, it is hard to decide the best metric to use, because all of them are useful and measure different aspects of the IR system. However, we chose Accuracy; but all of this depends on other internal or external factors that may make a single metric preferable over the others.

References¶

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Chapter 8: Evaluation in information retrieval. Retrieved from http://nlp.stanford.edu/IR-book/pdf/08eval.pdf