Skip to content

4. Scoring, Term Weighting, and the Vector Space Model

Introduction 1


Scoring, Term Weighting, and the Vector Space Model 2

6.1 Parametric and zone indexes

  • Parametric and zone indexes functions:
    • Allow index to retrieve documents by metadata such as language, author, date, etc.
    • Simple scoring for retried documents.
  • The metadata of a document contains fields and zones.
    • Fields have limited possible values. E.g. language, author, date of creation, etc.
    • zones values are free text. E.g. title, abstract, etc.
  • There is one parametric index per field/zone of metadata.

  • 6.1.1 Weighted zone scoring:

    • Weighted zone scoring is sometimes referred to as ranked Boolean retrieval.
    • Accumulators: an array of scores, one per document.
  • 6.1.2 Learning Weights:
    • Machine-learned relevance may be used to compute weights.

6.2 Term frequency and weighting

symbol meaning
N total number of documents in the collection
df(t) document frequency of term t
tf(t, d) term frequency of term t in document d
idf(t) inverse document frequency of term t
cf(t) collection frequency of term t
  • Bag of words:
    • A document is represented as a bag of words.
    • The order of the words is not important.
    • The number of times a word appears in the document is important.
  • Collection frequency: The number of occurrences of a term in the collection.
  • Document frequency: The number of documents in the collection that contain the term.
  • 6.2.1 Inverse document frequency:
    • Computed as idf(t) = log(N/df(t)) where N is the number of documents in the collection, and df(t) is the document frequency of term t.
    • The idf of rare terms is high.
  • 6.2.2 tf-idf weighting:
    • Computed as tf-idf(t, d) = tf(t, d) * idf(t).
    • Highest when t occurs many times within a small number of documents.
    • Lowest when t occurs in many documents; or occurs very few times in a few documents.
    • Lowest when t occurs in all documents.
  • Each document is a vector with one component per term.
  • Overlap Score measure:
    • Computed as score(q, d) = sum(tf-idf(t, d)) for all t in q; where t is a term in the query q.

6.3 The vector space model for scoring

  • Vector space model:
    • Each document is a vector, thus, the collection is a set of document vectors.
    • Each query is a vector.
    • The score of a document is the cosine of the angle between the document vector and the query vector.
  • 6.3.1 Dot products:
    • Used to compute the similarity between two vectors (two documents, or document and query).
    • Computed as sim(d1, d2) = ( V(d1) . V(d2) ) / ( |V(d1)| |V(d2)| ); where:
      • V(d) is the vector representation of document d.
      • . indicates the Dot product between two vectors.
      • |V(d)| is the Euclidean length of vector V(d).
    • Dot product:
      • Computed as V(d1) . V(d2) = sum(V(d1)[i] * V(d2)[i]) for all i.
      • V(d1)[i] is the ith component of vector V(d1).
    • Euclidean length:
      • Computed as |V(d)| = sqrt(sum(V(d)[i]^2) for all i).
    • The dot product equals the cosine of the angle between the two vectors.
    • Term document matrix:
      • Each column is a document vector.
      • Each row represents a term vector.
      • Thus, the matrix joins document vectors and term vectors.
  • 6.3.3 Computing vector scores
    • We store the following with Each Posing entry:
      • docID: The document ID.
      • termID: The term ID.
      • tf-idf: The tf-idf weight of the term in the document.
      • tf: The term frequency of the term in the document.

6.4 Variant tf-idf functions

  • 6.4.1 Sublinear tf scaling:
    • Uses the log of the term frequency instead of the term frequency.
  • 6.4.2 Maximum tf normalization
    • Normalizes the term frequency by the maximum term frequency in the document.
    • Computed as tf(t, d) = 0.5 + 0.5 * tf(t, d) / max(tf(t', d) for all t').
  • 6.4.3 Document and query weighting schemes:
    • SMART notation is used to describe the weighting schemes.
    • smart notation
  • 6.4.4 Pivoted normalized document length:
    • Normalizes the document length by the average document length.

References

‌ ‌


  1. UoPeople (2020) CS3308: Information Retrieval. Unit 4: Introduction & Video lectures. https://my.uopeople.edu/course/view.php?id=7253 

  2. Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/05comp.pdf. Chapter 6: Scoring, Term Weighting, and the Vector Space Model. 

  3. Oresoft LWC. (2011). WDM 57: Ranked Retrieval Model [YouTube Video]. In YouTube. https://www.youtube.com/watch?v=CTwqDj5gtLo&list=PL0174E49C0E0DD5C8&index=57