JA4. Cosine Similarity between documents¶
Statement¶
At the end of this assignment, you will be able to find the cosine similarity between the documents and use it further to recommend similar documents.
Assignment instructions
Assume that you have designed a search engine and the following three documents exist in the corpus (Note: In a real-world situation, there exist millions of documents with words ranging from hundreds to millions in each of them):
- Document 1: Earth is round.
- Document 2: Moon is round.
- Document 3: Day is nice.
It was found programmatically that a user always refers to Document 1. There is a need to recommend one document from the corpus to the user. Your search engine needs to find documents similar to Document 1. To accomplish the task, you need to carry out pre-processing (hint: “is” is a stop word), the vector creation of each document, and find the similarity of the vector of each document with a vector of Document 1.
Based on the highest similarity among the vectors, recommend the respective document to the user.
Compute the vector for each of the documents and explain the vectors. Describe in detail the steps that made you reach the valid similarity result [Hint: Use the algorithm specified in Figure 6.14 Introduction to Information Retrieval (stanford.edu)]. For each calculation and equation used, show the necessary references for it.
Solution¶
Introduction¶
To compute the cosine similarity between two documents (or a document and a query), the Vector Space Model is used. The Vector Space Model represents each document as a vector, and the cosine similarity between two vectors is used to determine how similar they are.
In our case, the vector of each document is represented by M components, where each component maps to one term in the corpus. The value of each component is the TF-IDF of the term in the document, which helps determine how important the term is in the document (Manning et al., 2009).
Calculating the TF-IDF (Term Frequency - Inverse Document Frequency) of each term in each document requires pre-processing the documents at the stage of preparing the index; the steps are described below.
- Process documents and generate terms.
- Compute the term frequency (TF) and document frequency (DF) for each term.
- Compute the inverse document frequency of each term (IDF).
- Compute the TF-IDF of each term in each document.
- Compute the cosine similarity between the query and each document.
Process documents and generate terms¶
The processing of documents includes the following steps:
- Remove stop words.
- Stemming/Lemmatization.
- Lowercasing.
- Remove punctuation.
- Remove numbers.
- Remove whitespace.
Here are the documents after processing:
Document | Doc ID | Document | Processed Document |
---|---|---|---|
1 | d1 | Earth is round | earth round |
2 | d2 | Moon is round | moon round |
3 | d3 | Day is nice | day nice |
Term And Document Frequency (Term document matrix)¶
Term frequency (TF) is the number of times a term appears in a document. Document frequency (DF) is the number of documents in the corpus that contain the term.
The table below shows the TF for each term in each document; the last row shows the DF for each term:
Doc ID | day | earth | moon | nice | round |
---|---|---|---|---|---|
d1 | 0 | 1 | 0 | 0 | 1 |
d2 | 0 | 0 | 1 | 0 | 1 |
d3 | 1 | 0 | 0 | 1 | 0 |
------ | — | ----- | ---- | ---- | ----- |
Document Frequency | 1 | 1 | 1 | 1 | 2 |
Inverse Document Frequency (IDF)¶
The IDF of a term is computed as:
Where \(N\) is the number of documents in the corpus, and \(\text{DF}(t)\) is the number of documents in the corpus that contain term \(t\).
\(N = 3\)
The table below shows the IDF for each term in the corpus:
Term | Equation | IDF |
---|---|---|
day | \(\log \frac{3}{1}\) | 1.099 |
earth | \(\log \frac{3}{1}\) | 1.099 |
moon | \(\log \frac{3}{1}\) | 1.099 |
nice | \(\log \frac{3}{1}\) | 1.099 |
round | \(\log \frac{3}{2}\) | 0.405 |
TF-IDF¶
The TF-IDF of a term in a document is computed as:
Where \(\text{TF}(t, d)\) is the number of times term \(t\) appears in document \(d\).
Doc ID | day | earth | moon | nice | round |
---|---|---|---|---|---|
d1 | 0 * 1.099 | 1 * 1.099 | 0 * 1.099 | 0 * 1.099 | 1 * 0.405 |
d2 | 0 * 1.099 | 0 * 1.099 | 1 * 1.099 | 0 * 1.099 | 1 * 0.405 |
d3 | 1 * 1.099 | 0 * 1.099 | 0 * 1.099 | 1 * 1.099 | 0 * 0.405 |
The table below shows the TF-IDF for each term in each document:
Doc ID | day | earth | moon | nice | round |
---|---|---|---|---|---|
d1 | 0 | 1.099 | 0 | 0 | 0.405 |
d2 | 0 | 0 | 1.099 | 0 | 0.405 |
d3 | 1.099 | 0 | 0 | 1.099 | 0 |
Document Vectors¶
As stated earlier, the vectors of the documents are represented by M components, where each component maps to one term in the corpus. The value of each component is the TF-IDF of the term in the document.
In our case, we have 5 components in the corpus, thus, our vectors consist of 5 components.
The table below shows the vector for each document; the last row shows how each vector was constructed:
Doc ID | Vector |
---|---|
d1 | [0, 1.099, 0, 0, 0.405] |
d2 | [0, 0, 1.099, 0, 0.405] |
d3 | [1.099, 0, 0, 1.099, 0] |
------ | ----------------------- |
Document ID | [idf(day), idf(earth), idf(moon), idf(nice), idf(round) ] |
Cosine Similarity¶
The cosine similarity between two vectors is computed as:
Where \(\vec{d}\) is the vector for the document, and \(\vec{q}\) is the vector for the query.
Let’s consider the query to be Document 1, and the query vector is:
Let’s prepare what we need to calculate the cosine similarity:
Doc ID | Query Vector | Doc Vector | \(\sum_{i=1}^{n} d_i q_i\) | \(\sqrt{\sum_{i=1}^{n} d_i^2}\) | \(\sqrt{\sum_{i=1}^{n} q_i^2}\) |
---|---|---|---|---|---|
d1 | [0, 1.099, 0, 0, 0.405] | [0, 1.099, 0, 0, 0.405] | 1.371826 | 1.17125 | 1.17125 |
d2 | [0, 1.099, 0, 0, 0.405] | [0, 0, 1.099, 0, 0.405] | 0.164025 | 1.17125 | 1.17125 |
d3 | [0, 1.099, 0, 0, 0.405] | [1.099, 0, 0, 1.099, 0] | 0 | 1.55422 | 1.17125 |
The cosine similarity between the query and each document is:
Doc ID | Cosine Similarity |
---|---|
d1 | 1 |
d2 | 0.12 |
d3 | 0 |
Conclusion¶
In our discussion, we assumed that the user query equals Document 1, that is, the user is looking for Earth is round
; and based on the cosine similarity table that we have computed, document 1 has the highest similarity with the query, which is expected as it is the same as the query. The second document is more relevant than the third, as the cosine similarity is higher. So if we were to recommend a document to the user, we would recommend Document 1, and Document 2 (if it passes a certain agreed-upon threshold); but Document 3 is cetainly not relevant to the user query.
References¶
- Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/05comp.pdf. Chapter 6: Scoring, Term Weighting, and the Vector Space Model.
- Oresoft LWC. (2011). WDM 57: Ranked Retrieval Model [YouTube Video]. In YouTube. https://www.youtube.com/watch?v=CTwqDj5gtLo&list=PL0174E49C0E0DD5C8&index=57