Skip to content

7. Ranking in a Complete Search System

7.1 Efficient scoring and ranking 1

  • 7.1.1 Inexact top K document retrieval:
    • Retrieve K documents whose scores are very close to those of the K best.
    • While avoiding computing scores for most of the N documents in the collection.
  • 7.1.2 Index elimination:
    • For a query with multiple terms, we consider documents that at least contain one of the query terms; and then:
      1. Consider only documents with IDF passing a certain threshold => low IDS documents are ignored => fewer posting lists to retrieve and process.
      2. Only consider documents that contain a certain fraction of the query terms => E.g. Ignore documents that contain less than 3 terms (in a query of 5 terms) => fewer documents to consider.
    • Both thresholds are highly application-dependent, and a problem may arise if we choose thresholds that return less than K documents, where we can not the required K number of documents.
  • 7.1.3 Champion lists:
    • Also known as fancy lists or top docs.
    • For each term, we store a list of documents that have the highest scores for that term.
    • When a query is issued, we only consider documents in the champion lists.
    • Champion lists are usually small (e.g. 1000 documents) and can be stored in memory.
    • Champion lists can be precomputed offline.
    • The length of the champion list r is a parameter that can be tuned, but it is determined while indexing and before the query is issued.
    • The number of documents to return k depends on the query itself and is determined at query time.
    • Problem: if we choose r that is smaller than k, we may not be able to retrieve the top k documents.
  • 7.1.4 Static quality scores and ordering:
    • This improves the selection of champion lists.
    • Static quality scores are query-independent scores that measure the quality of a document.
    • Net score of a document = static quality score + dynamic score (computed at query time) = static score + cosine similarity.
    • \(\text{net-score(q,d)} = g(d) + \frac{\vec{d} \cdot \vec{q}}{\|\vec{d}\| \|\vec{q}\|}\)
      • Where \(g(d)\) is the static quality score of document \(d\) and its value is between 0 and 1.
      • The net score now is between 0 and 2 (0-1 static score + 0-1 cosine similarity).
    • We may also store a second champion list containing r documents right below the first champion list.
    • When the query is issued; if the first champion list returns less than k documents, we can use the second champion list to retrieve the remaining documents.
  • 7.1.5 Impact ordering:
    • The idea is to order the documents based on a key that is different from the term ID.
    • Instead, terms are ordered by their tf(t,d) (term frequency in the document).
    • Then, we don’t need to process all items in a champion list; but only until we reach the k documents or the tf(t,d) drops below a certain threshold.
    • Another idea is to order the terms by their IDF.
  • 7.1.6 Cluster pruning:
    • A preprocessing step that clusters documents into groups.
    • At query time, only a few clusters are considered:
      • Pick N^½ documents randomly that are closest to the query, and call them leaders.
      • For each document that is not a leader, compute its nearest leader.
      • Not leader documents are called followers.
    • The process also accepts two parameters b1, and b2 which control the number of leaders and followers.
    • The randomness is proven to be effective in practice.

7.2 Components of an information retrieval system 1

  • 7.2.1 Tiered indexes:
    • We store the index in multiple tiers, where each tier is a different index.
    • We always start from tier 1, and we only process the next tier if we do not reach the required k number of documents.
    • E.g.: tier 1 contains all words with tf(t,t) > 20, and tier 2 contains all words with 20 > tf(t,t) > 10; and so on.
  • 7.2.2 Query-term proximity:
    • We can use the position of the query terms in the document to improve the ranking.
    • The window w (number of words) in the document that contains all query terms.
    • Documents with smaller windows are ranked higher.
    • E.g.: if the query is “Ahmad student”:
      • Document 1: Ahmad is a student at the University of Jordan. Window = 3.
      • Document 2: Ahmad is a web developer, he used to be a student. Window = 8.
      • Document 1 is ranked higher than Document 2, as it has a smaller window that contains all query terms.
  • 7.2.3 Designing parsing and scoring functions:
    • This depends on user population, query distribution, and the collection of documents itself.
    • Query parser:
      • It is a function that translates a user query into a query that can be processed by the search engine; that is, adding/removing the necessary operators and terms.
      • Sometimes, more than one query is generated by the query parser for each user query, all of them are processed and the results are merged.
  • 7.2.4 Put it all together:
    • We have now studied all the components necessary for a basic search system that supports free text queries as well as Boolean, zone and field queries.
    • search system

7.3 Vector space scoring and query operator interaction 1

  • It helps answer boolean, wildcard, and phrase queries.

References


  1. Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Chapter 7: Computing Scores in a Complete Search System. http://nlp.stanford.edu/IR-book/pdf/07system.pdf