4. Scoring, Term Weighting, and the Vector Space Model¶

Introduction ¹¶

Parametric and zone indexes functions:
- Allow index to retrieve documents by metadata such as language, author, date, etc.
- Simple scoring for retried documents.
The metadata of a document contains fields and zones.
- Fields have limited possible values. E.g. language, author, date of creation, etc.
- zones values are free text. E.g. title, abstract, etc.
There is one parametric index per field/zone of metadata.
6.1.1 Weighted zone scoring:
- Weighted zone scoring is sometimes referred to as ranked Boolean retrieval.
- Accumulators: an array of scores, one per document.
6.1.2 Learning Weights:
- Machine-learned relevance may be used to compute weights.

symbol	meaning
N	total number of documents in the collection
df(t)	document frequency of term `t`
tf(t, d)	term frequency of term `t` in document `d`
idf(t)	inverse document frequency of term `t`
cf(t)	collection frequency of term `t`

Bag of words:
- A document is represented as a bag of words.
- The order of the words is not important.
- The number of times a word appears in the document is important.
Collection frequency: The number of occurrences of a term in the collection.
Document frequency: The number of documents in the collection that contain the term.
6.2.1 Inverse document frequency:
- Computed as idf(t) = log(N/df(t)) where N is the number of documents in the collection, and df(t) is the document frequency of term t.
- The idf of rare terms is high.
6.2.2 tf-idf weighting:
- Computed as tf-idf(t, d) = tf(t, d) * idf(t).
- Highest when t occurs many times within a small number of documents.
- Lowest when t occurs in many documents; or occurs very few times in a few documents.
- Lowest when t occurs in all documents.
Each document is a vector with one component per term.
Overlap Score measure:
- Computed as score(q, d) = sum(tf-idf(t, d)) for all t in q; where t is a term in the query q.

Vector space model:
- Each document is a vector, thus, the collection is a set of document vectors.
- Each query is a vector.
- The score of a document is the cosine of the angle between the document vector and the query vector.
6.3.1 Dot products:
- Used to compute the similarity between two vectors (two documents, or document and query).
- Computed as sim(d1, d2) = ( V(d1) . V(d2) ) / ( |V(d1)| |V(d2)| ); where:
  - V(d) is the vector representation of document d.
  - . indicates the Dot product between two vectors.
  - |V(d)| is the Euclidean length of vector V(d).
- Dot product:
  - Computed as V(d1) . V(d2) = sum(V(d1)[i] * V(d2)[i]) for all i.
  - V(d1)[i] is the ith component of vector V(d1).
- Euclidean length:
  - Computed as |V(d)| = sqrt(sum(V(d)[i]^2) for all i).
- The dot product equals the cosine of the angle between the two vectors.
- Term document matrix:
  - Each column is a document vector.
  - Each row represents a term vector.
  - Thus, the matrix joins document vectors and term vectors.
6.3.3 Computing vector scores
- We store the following with Each Posing entry:
  - docID: The document ID.
  - termID: The term ID.
  - tf-idf: The tf-idf weight of the term in the document.
  - tf: The term frequency of the term in the document.

6.4.1 Sublinear tf scaling:
- Uses the log of the term frequency instead of the term frequency.
6.4.2 Maximum tf normalization
- Normalizes the term frequency by the maximum term frequency in the document.
- Computed as tf(t, d) = 0.5 + 0.5 * tf(t, d) / max(tf(t', d) for all t').
6.4.3 Document and query weighting schemes:
- SMART notation is used to describe the weighting schemes.
6.4.4 Pivoted normalized document length:
- Normalizes the document length by the average document length.

‌ ‌

UoPeople (2020) CS3308: Information Retrieval. Unit 4: Introduction & Video lectures. https://my.uopeople.edu/course/view.php?id=7253 ↩
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/05comp.pdf. Chapter 6: Scoring, Term Weighting, and the Vector Space Model. ↩
Oresoft LWC. (2011). WDM 57: Ranked Retrieval Model [YouTube Video]. In YouTube. https://www.youtube.com/watch?v=CTwqDj5gtLo&list=PL0174E49C0E0DD5C8&index=57 ↩