Using random combinations of query terms as information needs is not a good idea, as it does not reflect the actual usage of the system.
Pooling:
Relevance is assessed over a subset of the collection that is formed from the top K documents retrieved by different IR systems.
Kappa Statistic:
It is a measure of agreement between two annotators(judges).
It is designed for categorical judgments and corrects a simple agreement rate for the rate of chance agreement.
It is calculated as follows:
Kappa = (P(A) - P(E)) / (1 - P(E))
Where P(A) is the observed agreement and P(E) is the expected agreement.
It is a number between -1 and 1.
Kappa =:
1: Perfect agreement, that is, the two judges always agree.
0: Agreement equivalent to the chance agreement (random agreement).
<0: the two judges agree less than would be expected by chance.
>0.8: Good agreement.
0.67-0.8: Fair agreement.
<0.67: Poor agreement or data is dubious.
The relevance of one document is treated as independent of the relevance of other documents in the collection.
This assumption is built into most retrieval systems – documents are scored against queries, not against each other – as well as being assumed in the evaluation methods.
8.6 A broader perspective: System quality and user utility¶
Formal evaluation measures are at some distance from our ultimate interest in measures of human utility: how satisfied is each user with the results the system gives for each information need that they pose?
A/B testing:
It is a method of comparing two versions of a webpage or app against each other to determine which one performs better.
Deploying two versions (A and B) of an IR system that differ in one aspect (e.g., ranking function), then forwarding (1-10%) of traffic to the new system (B), and then comparing the results with the old system (A).
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Chapter 8: Evaluation in information retrieval. Retrieved from http://nlp.stanford.edu/IR-book/pdf/08eval.pdf