6. Evaluation in Information Retrieval¶

8.1 Information Retrieval System Evaluation¶

Gold Standard:
- Also known as Ground Truth.
- Documents are given a binary relevance score (0 or 1) corresponding to whether they are relevant or not to the query.
Relevance:
- Relevance is assessed relative to an information need, not a query.
- A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.

8.2 Standard Test Collections¶

CranField Collection:
- It is a collection of 1400 documents from the aerodynamics field and a set of 225 queries.
- It is considered too small for modern IR systems, but it is a good starting point for experimentation.
- Collected in the late 1950s.
Text Retrieval Conference TREC:
- Developed by the National Institute of Standards and Technology (NIST) in the US.
- It contains 1.9 million documents (mostly news articles) and 450 topics (information needs).
- Developed in the 1990s.
GOV2:
- Also developed by NIST.
- It contains 25 million documents (mostly web pages).
- It is the biggest collection used for research purposes.
- Developed in the 2000s.
NTCIR:
- NII Test Collection for IR Systems (NTCIR).
- It has a similar size to TREC.
- It focuses on Asian languages and cross-language information retrieval.
- https://research.nii.ac.jp/ntcir/data/data-en.html
CLEF:
- It stands for Cross-Language Evaluation Forum.
- It is a European initiative.
- http://www.clef-campaign.org/
Reuters:
- Reuters-21578 consists of 21578 news articles.
- RCV1 consists of 800,000 news articles.
- It is a good collection due to its rich annotations.
NewsGroups:
- It consists of 1000 articles for each of its 20 different newsgroups (categories).
- total of 20,000 articles.

8.3 Evaluation of UnRanked Retrieval Sets¶

	Relevant	Non-Relevant
Retrieved	True Positive (tp)	False Positive (fp)
Not Retrieved	False Negative (fn)	True Negative (tn)

Precision:
- It is the fraction of retrieved documents that are relevant to the query.
- It is a measure of exactness.
- It is a number between 0 and 1.
- It is calculated as follows:
  - Precision = (Number of relevant documents retrieved) / (Total number of documents retrieved)
  - P = tp / (tp + fp)
Recall:
- It is the fraction of relevant documents that are retrieved.
- It is a measure of completeness.
- It is a number between 0 and 1.
- It is calculated as follows:
  - Recall = (Number of relevant documents retrieved) / (Total number of relevant documents)
  - R = tp / (tp + fn)
Accuracy:
- It is the fraction of documents that are correctly classified over the total number of documents.
- It is calculated as follows:
  - Accuracy = (tp + tn) / (tp + tn + fp + fn) = (tp + tn) / (total number of documents)
- This is useful for measuring machine learning models.
- It is not useful for measuring IR systems, as:
  - The data is extremely skewed towards non-relevant documents (always more than 99.9% of the documents are non-relevant).
In web search: more relevant => higher precision (recall is less important on the first page).
In professional search: more relevant => higher recall (precision is less important, as the user has extended knowledge of what they want).
Searching local files: more relevant => higher recall (precision is less important, as the user wants all possible results).
F-Measure:
- It is the harmonic mean of precision and recall.
- It is calculated as follows:
  - F = (B^2 + 1) * (P * R) / (B^2 * P + R)
  - Where B is a constant parameter B^2 = (1-a)/a; a is between 0 and 1.
- The default balanced F measure (F1) equally weights precision and recall (a = 0.5) and thus B = 1.
  - F1 = 2 * (P * R) / (P + R)
- B:
  - B=1: F1 => Recall and Precision are equally important.
  - B<1: Precision is more important.
  - B>1: Recall is more important.

8.4 Evaluation of ranked retrieval results¶

Precision, recall, and the F measure are set-based measures. They are computed using unordered sets of documents.
Precision-recall curves:
- The precision-recall curve is a graph in which the x-axis is recall and the y-axis is precision.
- It has a distinctive saw-tooth shape.
- If (K + 1)^th document that is retrieved is:
  - non-relevant, then the recall is similar to the top K, but the precision drops.
  - relevant, then both precision and recall increase.
MAP (Mean Average Precision):
- It is the average of the precision values obtained after each relevant document is retrieved.
- It is calculated as follows:
  - MAP = (P1 + P2 + … + Pn) / n
  - Where n is the total number of relevant documents.
- It is a number between 0 and 1.
- It is a good measure for comparing different systems.
- It is not a good measure for comparing different queries.
Precision at K:
- It is the precision measured at the Kth document.
- It is calculated as follows:
  - P@K = (Number of relevant documents retrieved in the top K) / K
- It is a number between 0 and 1.
- It is a good measure for comparing different queries.
- It is not a good measure for comparing different systems.
R-Precision:
- It is the precision measured at the Rth document, where R is the total number of relevant documents.
- It is calculated as follows:
  - R-Precision = (Number of relevant documents retrieved in the top K) / R
  - Where R is the size of the set of relevant documents.
- It is a number between 0 and 1.
- It is a good measure for comparing different queries.
- It is not a good measure for comparing different systems.
Break-even point:
- It is the best point on the curve (the point with maximal F-measure).
- It is also a retrieval level of interest to a particular application (Precision at k) the point where precision and recall are equal.
ROC curve:
- It stands for Receiver Operating Characteristic curve.
- It is a graph in which the x-axis is the false positive rate (FPR, 1-specificity) and the y-axis is the true positive rate (TPR, recall).
Sensitivity:
- It is the same as the true positive rate (TPR).
- Sensitivity = Recall = tp / (tp + fn)
Specificity:
- It is the same as the true negative rate (TNR).
- Specificity = tn / (tn + fp)
Accumulated Gain:
- The normalized discounted cumulative gain (NDCG) is designed for non-binary notions of relevance.

8.5 Assessing relevance¶

Using random combinations of query terms as information needs is not a good idea, as it does not reflect the actual usage of the system.
Pooling:
- Relevance is assessed over a subset of the collection that is formed from the top K documents retrieved by different IR systems.
Kappa Statistic:
- It is a measure of agreement between two annotators(judges).
- It is designed for categorical judgments and corrects a simple agreement rate for the rate of chance agreement.
- It is calculated as follows:
  - Kappa = (P(A) - P(E)) / (1 - P(E))
  - Where P(A) is the observed agreement and P(E) is the expected agreement.
- It is a number between -1 and 1.
- Kappa =:
  - 1: Perfect agreement, that is, the two judges always agree.
  - 0: Agreement equivalent to the chance agreement (random agreement).
  - <0: the two judges agree less than would be expected by chance.
  - >0.8: Good agreement.
  - 0.67-0.8: Fair agreement.
  - <0.67: Poor agreement or data is dubious.
The relevance of one document is treated as independent of the relevance of other documents in the collection.
This assumption is built into most retrieval systems – documents are scored against queries, not against each other – as well as being assumed in the evaluation methods.

8.6 A broader perspective: System quality and user utility¶

Formal evaluation measures are at some distance from our ultimate interest in measures of human utility: how satisfied is each user with the results the system gives for each information need that they pose?
A/B testing:
- It is a method of comparing two versions of a webpage or app against each other to determine which one performs better.
- Deploying two versions (A and B) of an IR system that differ in one aspect (e.g., ranking function), then forwarding (1-10%) of traffic to the new system (B), and then comparing the results with the old system (A).
- The forwarded traffic is randomly selected.

8.7 Results snippets¶

Snippet:
- It is a short summary of the document that is displayed in the search results.
- It allows the user to quickly assess the relevance of the document, without having to open it.
- snippets are static or dynamic.
- Static snippets:
  - They are generated by extracting the first few sentences from the document.
  - They are not very effective.
- Dynamic snippets:
  - They are generated by extracting the sentences that contain the query terms.
  - They are more effective.

References¶

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Chapter 8: Evaluation in information retrieval. Retrieved from http://nlp.stanford.edu/IR-book/pdf/08eval.pdf