Skip to content

6. Evaluation in Information Retrieval

8.1 Information Retrieval System Evaluation

  • Gold Standard:
    • Also known as Ground Truth.
    • Documents are given a binary relevance score (0 or 1) corresponding to whether they are relevant or not to the query.
  • Relevance:
    • Relevance is assessed relative to an information need, not a query.
    • A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.

8.2 Standard Test Collections

  • CranField Collection:
    • It is a collection of 1400 documents from the aerodynamics field and a set of 225 queries.
    • It is considered too small for modern IR systems, but it is a good starting point for experimentation.
    • Collected in the late 1950s.
  • Text Retrieval Conference TREC:
    • Developed by the National Institute of Standards and Technology (NIST) in the US.
    • It contains 1.9 million documents (mostly news articles) and 450 topics (information needs).
    • Developed in the 1990s.
  • GOV2:
    • Also developed by NIST.
    • It contains 25 million documents (mostly web pages).
    • It is the biggest collection used for research purposes.
    • Developed in the 2000s.
  • NTCIR:
  • CLEF:
  • Reuters:
    • Reuters-21578 consists of 21578 news articles.
    • RCV1 consists of 800,000 news articles.
    • It is a good collection due to its rich annotations.
  • NewsGroups:
    • It consists of 1000 articles for each of its 20 different newsgroups (categories).
    • total of 20,000 articles.

8.3 Evaluation of UnRanked Retrieval Sets

Relevant Non-Relevant
Retrieved True Positive (tp) False Positive (fp)
Not Retrieved False Negative (fn) True Negative (tn)
  • Precision:
    • It is the fraction of retrieved documents that are relevant to the query.
    • It is a measure of exactness.
    • It is a number between 0 and 1.
    • It is calculated as follows:
      • Precision = (Number of relevant documents retrieved) / (Total number of documents retrieved)
      • P = tp / (tp + fp)
  • Recall:
    • It is the fraction of relevant documents that are retrieved.
    • It is a measure of completeness.
    • It is a number between 0 and 1.
    • It is calculated as follows:
      • Recall = (Number of relevant documents retrieved) / (Total number of relevant documents)
      • R = tp / (tp + fn)
  • Accuracy:
    • It is the fraction of documents that are correctly classified over the total number of documents.
    • It is calculated as follows:
      • Accuracy = (tp + tn) / (tp + tn + fp + fn) = (tp + tn) / (total number of documents)
    • This is useful for measuring machine learning models.
    • It is not useful for measuring IR systems, as:
      • The data is extremely skewed towards non-relevant documents (always more than 99.9% of the documents are non-relevant).
  • In web search: more relevant => higher precision (recall is less important on the first page).
  • In professional search: more relevant => higher recall (precision is less important, as the user has extended knowledge of what they want).
  • Searching local files: more relevant => higher recall (precision is less important, as the user wants all possible results).
  • F-Measure:
    • It is the harmonic mean of precision and recall.
    • It is calculated as follows:
      • F = (B^2 + 1) * (P * R) / (B^2 * P + R)
      • Where B is a constant parameter B^2 = (1-a)/a; a is between 0 and 1.
    • The default balanced F measure (F1) equally weights precision and recall (a = 0.5) and thus B = 1.
      • F1 = 2 * (P * R) / (P + R)
    • B:
      • B=1: F1 => Recall and Precision are equally important.
      • B<1: Precision is more important.
      • B>1: Recall is more important.

8.4 Evaluation of ranked retrieval results

  • Precision, recall, and the F measure are set-based measures. They are computed using unordered sets of documents.
  • Precision-recall curves:
    • The precision-recall curve is a graph in which the x-axis is recall and the y-axis is precision.
    • It has a distinctive saw-tooth shape.
    • If (K + 1)^th document that is retrieved is:
      • non-relevant, then the recall is similar to the top K, but the precision drops.
      • relevant, then both precision and recall increase.
  • MAP (Mean Average Precision):
    • It is the average of the precision values obtained after each relevant document is retrieved.
    • It is calculated as follows:
      • MAP = (P1 + P2 + … + Pn) / n
      • Where n is the total number of relevant documents.
    • It is a number between 0 and 1.
    • It is a good measure for comparing different systems.
    • It is not a good measure for comparing different queries.
  • Precision at K:
    • It is the precision measured at the Kth document.
    • It is calculated as follows:
      • P@K = (Number of relevant documents retrieved in the top K) / K
    • It is a number between 0 and 1.
    • It is a good measure for comparing different queries.
    • It is not a good measure for comparing different systems.
  • R-Precision:
    • It is the precision measured at the Rth document, where R is the total number of relevant documents.
    • It is calculated as follows:
      • R-Precision = (Number of relevant documents retrieved in the top K) / R
      • Where R is the size of the set of relevant documents.
    • It is a number between 0 and 1.
    • It is a good measure for comparing different queries.
    • It is not a good measure for comparing different systems.
  • Break-even point:
    • It is the best point on the curve (the point with maximal F-measure).
    • It is also a retrieval level of interest to a particular application (Precision at k) the point where precision and recall are equal.
  • ROC curve:
    • It stands for Receiver Operating Characteristic curve.
    • It is a graph in which the x-axis is the false positive rate (FPR, 1-specificity) and the y-axis is the true positive rate (TPR, recall).
  • Sensitivity:
    • It is the same as the true positive rate (TPR).
    • Sensitivity = Recall = tp / (tp + fn)
  • Specificity:
    • It is the same as the true negative rate (TNR).
    • Specificity = tn / (tn + fp)
  • Accumulated Gain:
    • The normalized discounted cumulative gain (NDCG) is designed for non-binary notions of relevance.

8.5 Assessing relevance

  • Using random combinations of query terms as information needs is not a good idea, as it does not reflect the actual usage of the system.
  • Pooling:
    • Relevance is assessed over a subset of the collection that is formed from the top K documents retrieved by different IR systems.
  • Kappa Statistic:
    • It is a measure of agreement between two annotators(judges).
    • It is designed for categorical judgments and corrects a simple agreement rate for the rate of chance agreement.
    • It is calculated as follows:
      • Kappa = (P(A) - P(E)) / (1 - P(E))
      • Where P(A) is the observed agreement and P(E) is the expected agreement.
    • It is a number between -1 and 1.
    • Kappa =:
      • 1: Perfect agreement, that is, the two judges always agree.
      • 0: Agreement equivalent to the chance agreement (random agreement).
      • <0: the two judges agree less than would be expected by chance.
      • >0.8: Good agreement.
      • 0.67-0.8: Fair agreement.
      • <0.67: Poor agreement or data is dubious.
  • The relevance of one document is treated as independent of the relevance of other documents in the collection.
  • This assumption is built into most retrieval systems – documents are scored against queries, not against each other – as well as being assumed in the evaluation methods.

8.6 A broader perspective: System quality and user utility

  • Formal evaluation measures are at some distance from our ultimate interest in measures of human utility: how satisfied is each user with the results the system gives for each information need that they pose?
  • A/B testing:
    • It is a method of comparing two versions of a webpage or app against each other to determine which one performs better.
    • Deploying two versions (A and B) of an IR system that differ in one aspect (e.g., ranking function), then forwarding (1-10%) of traffic to the new system (B), and then comparing the results with the old system (A).
    • The forwarded traffic is randomly selected.

8.7 Results snippets

  • Snippet:
    • It is a short summary of the document that is displayed in the search results.
    • It allows the user to quickly assess the relevance of the document, without having to open it.
    • snippets are static or dynamic.
    • Static snippets:
      • They are generated by extracting the first few sentences from the document.
      • They are not very effective.
    • Dynamic snippets:
      • They are generated by extracting the sentences that contain the query terms.
      • They are more effective.

References

  • Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Chapter 8: Evaluation in information retrieval. Retrieved from http://nlp.stanford.edu/IR-book/pdf/08eval.pdf