Skip to content

JA4. Cosine Similarity between documents

Statement

At the end of this assignment, you will be able to find the cosine similarity between the documents and use it further to recommend similar documents.

Assignment instructions

Assume that you have designed a search engine and the following three documents exist in the corpus (Note: In a real-world situation, there exist millions of documents with words ranging from hundreds to millions in each of them):

  • Document 1: Earth is round.
  • Document 2: Moon is round.
  • Document 3: Day is nice.

It was found programmatically that a user always refers to Document 1. There is a need to recommend one document from the corpus to the user. Your search engine needs to find documents similar to Document 1. To accomplish the task, you need to carry out pre-processing (hint: “is” is a stop word), the vector creation of each document, and find the similarity of the vector of each document with a vector of Document 1.

Based on the highest similarity among the vectors, recommend the respective document to the user.

Compute the vector for each of the documents and explain the vectors. Describe in detail the steps that made you reach the valid similarity result [Hint: Use the algorithm specified in Figure 6.14 Introduction to Information Retrieval (stanford.edu)]. For each calculation and equation used, show the necessary references for it.

Solution

Introduction

To compute the cosine similarity between two documents (or a document and a query), the Vector Space Model is used. The Vector Space Model represents each document as a vector, and the cosine similarity between two vectors is used to determine how similar they are.

In our case, the vector of each document is represented by M components, where each component maps to one term in the corpus. The value of each component is the TF-IDF of the term in the document, which helps determine how important the term is in the document (Manning et al., 2009).

Calculating the TF-IDF (Term Frequency - Inverse Document Frequency) of each term in each document requires pre-processing the documents at the stage of preparing the index; the steps are described below.

  1. Process documents and generate terms.
  2. Compute the term frequency (TF) and document frequency (DF) for each term.
  3. Compute the inverse document frequency of each term (IDF).
  4. Compute the TF-IDF of each term in each document.
  5. Compute the cosine similarity between the query and each document.

Process documents and generate terms

The processing of documents includes the following steps:

  • Remove stop words.
  • Stemming/Lemmatization.
  • Lowercasing.
  • Remove punctuation.
  • Remove numbers.
  • Remove whitespace.

Here are the documents after processing:

Document Doc ID Document Processed Document
1 d1 Earth is round earth round
2 d2 Moon is round moon round
3 d3 Day is nice day nice

Term And Document Frequency (Term document matrix)

Term frequency (TF) is the number of times a term appears in a document. Document frequency (DF) is the number of documents in the corpus that contain the term.

The table below shows the TF for each term in each document; the last row shows the DF for each term:

Doc ID day earth moon nice round
d1 0 1 0 0 1
d2 0 0 1 0 1
d3 1 0 0 1 0
------ ----- ---- ---- -----
Document Frequency 1 1 1 1 2

Inverse Document Frequency (IDF)

The IDF of a term is computed as:

\[ \text{IDF}(t) = \log \frac{N}{\text{DF}(t)} \]

Where \(N\) is the number of documents in the corpus, and \(\text{DF}(t)\) is the number of documents in the corpus that contain term \(t\).

\(N = 3\)

The table below shows the IDF for each term in the corpus:

Term Equation IDF
day \(\log \frac{3}{1}\) 1.099
earth \(\log \frac{3}{1}\) 1.099
moon \(\log \frac{3}{1}\) 1.099
nice \(\log \frac{3}{1}\) 1.099
round \(\log \frac{3}{2}\) 0.405

TF-IDF

The TF-IDF of a term in a document is computed as:

\[ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) \]

Where \(\text{TF}(t, d)\) is the number of times term \(t\) appears in document \(d\).

Doc ID day earth moon nice round
d1 0 * 1.099 1 * 1.099 0 * 1.099 0 * 1.099 1 * 0.405
d2 0 * 1.099 0 * 1.099 1 * 1.099 0 * 1.099 1 * 0.405
d3 1 * 1.099 0 * 1.099 0 * 1.099 1 * 1.099 0 * 0.405

The table below shows the TF-IDF for each term in each document:

Doc ID day earth moon nice round
d1 0 1.099 0 0 0.405
d2 0 0 1.099 0 0.405
d3 1.099 0 0 1.099 0

Document Vectors

As stated earlier, the vectors of the documents are represented by M components, where each component maps to one term in the corpus. The value of each component is the TF-IDF of the term in the document.

In our case, we have 5 components in the corpus, thus, our vectors consist of 5 components.

The table below shows the vector for each document; the last row shows how each vector was constructed:

Doc ID Vector
d1 [0, 1.099, 0, 0, 0.405]
d2 [0, 0, 1.099, 0, 0.405]
d3 [1.099, 0, 0, 1.099, 0]
------ -----------------------
Document ID [idf(day), idf(earth), idf(moon), idf(nice), idf(round) ]

Cosine Similarity

The cosine similarity between two vectors is computed as:

\[ \cos(\vec{d}, \vec{q}) = \frac{\vec{d} \cdot \vec{q}}{\|\vec{d}\| \|\vec{q}\|} = \frac{\sum_{i=1}^{n} d_i q_i}{\sqrt{\sum_{i=1}^{n} d_i^2} \sqrt{\sum_{i=1}^{n} q_i^2}} \]

Where \(\vec{d}\) is the vector for the document, and \(\vec{q}\) is the vector for the query.

Let’s consider the query to be Document 1, and the query vector is:

\[ \vec{q} = [0, 1.099, 0, 0, 0.405] \]

Let’s prepare what we need to calculate the cosine similarity:

Doc ID Query Vector Doc Vector \(\sum_{i=1}^{n} d_i q_i\) \(\sqrt{\sum_{i=1}^{n} d_i^2}\) \(\sqrt{\sum_{i=1}^{n} q_i^2}\)
d1 [0, 1.099, 0, 0, 0.405] [0, 1.099, 0, 0, 0.405] 1.371826 1.17125 1.17125
d2 [0, 1.099, 0, 0, 0.405] [0, 0, 1.099, 0, 0.405] 0.164025 1.17125 1.17125
d3 [0, 1.099, 0, 0, 0.405] [1.099, 0, 0, 1.099, 0] 0 1.55422 1.17125

The cosine similarity between the query and each document is:

Doc ID Cosine Similarity
d1 1
d2 0.12
d3 0

Conclusion

In our discussion, we assumed that the user query equals Document 1, that is, the user is looking for Earth is round; and based on the cosine similarity table that we have computed, document 1 has the highest similarity with the query, which is expected as it is the same as the query. The second document is more relevant than the third, as the cosine similarity is higher. So if we were to recommend a document to the user, we would recommend Document 1, and Document 2 (if it passes a certain agreed-upon threshold); but Document 3 is cetainly not relevant to the user query.

References