Skip to content

DA3. Power Laws

Statement

  • The output for the indexer that we started to develop in unit 2 and we are continuing to develop in this unit (unit 3) includes statistics such as the number of documents, number total terms and the number of unique terms in the collection added to the index.
  • Heap’s law provides a formula that can be used to estimate the number of unique terms in a collection based upon constants k and b and the number of terms or tokens (T) parsed from all documents. M = kT^b
  • In textbook in section 5.1.1 (page 88 of the textbook), we are provided typical values for both k and b. The value of k is typically a range between 10 and 100 and ß ≈ .4 to .6.
  • Requirements:
    1. Using the formula for Heap’s law calculate the estimated size of the vocabulary (M) using the total number of terms parsed from all documents statistic reported when running your indexer program. Given the fact that both k and ß are typically found through empirical analysis, assume that k will be 40 and ß will be .50.
    2. Compare the estimate with the “total number of unique terms found and added to the index” statistic reported by your indexer program which represents the actual size of the vocabulary in your collection.
    3. Report your findings in a posting response in the unit 3 discussion forum.
    4. If the size of the vocabulary estimated by Heap’s law is not consistent with the vocabulary discovered by your indexer process speculate on why this may have occurred. Consider that this discrepancy may be uncovering a flaw in your program or that the corpus you are using may be limited in vocabulary due to its subject content.
  • Discuss your findings with your peers and provide feedback to at least 3 peers on this submission.

Solution

In the previous written assignment (unit 2), we run an indexer against a corpus of 570 documents from the Reuters collection. The indexer reported the following statistics:

  • Total number of documents: 570
  • Total number of tokens: 47201.
  • Total number of unique terms: 4280.

The first requirement of this assignment is to use Heap’s law to estimate the size of the vocabulary (M) using the total number of terms parsed and the following parameters: k = 40 and ß = 0.5.

\[ \begin{align} M &= kT^ß \\ &= 40 \times 47201^{0.5} \\ &= 40 \times 217.3 \\ &\approx 8692 \end{align} \]

The second requirement of this assignment is to compare the estimate with the actual size of the vocabulary in the collection. Which is summarized below (UoPeople, 2023):

  • The actual size of the vocabulary is 4280.
  • The estimated size of the vocabulary is 8692.
  • The estimated size of the vocabulary is almost twice the actual size of the vocabulary. This is a significant difference.

The third requirement of this assignment is to report findings and try to explain the difference between the estimated size and the actual size of the vocabulary. The following are some possible explanations (Manning et al., 2009):

  • The collection may be centralized around a specific topic, which may limit the vocabulary size as opposed to a more general collection.
  • The constants k and ß are typically found through empirical analysis, which means that they may not be accurate for all collections, and there may be different constants that make the estimate more accurate.

Here are some other notes that should increase the number of unique terms:

  • The corpus is provided in markup (html) format; the indexer is removing the < and > characters from the tokens, but the tags themselves are still parts of the tokens, which may increase the number of unique terms.
  • The indexer does not do spelling corrections, thus, a number of unique terms may be generated for the same word (with misspellings counted as new words).
  • The indexer does not exclude numbers, which adds around 1100 unique terms to the vocabulary (I manually examined the output of the indexer in index.dat file and counted the number of unique terms that are numbers). In fact some numbers like page numbers are not useful for retrieval and should be excluded from the vocabulary.

Now let’s verify the Heap’s law with k= 30 and ß = 0.5. The estimated size of the vocabulary is:

\[ \begin{align} M &= kT^ß \\ &= 30 \times 47201^{0.5} \\ &= 30 \times 217.3 \\ &\approx 6519 \end{align} \]

That’s is still higher than the actual size of the vocabulary, let’s try k= 20 and ß = 0.5. The estimated size of the vocabulary is:

\[ \begin{align} M &= kT^ß \\ &= 20 \times 47201^{0.5} \\ &= 20 \times 217.3 \\ &\approx 4346 \end{align} \]

This is very close to the actual size of the vocabulary, which is 4280. Thus, K=20 and ß = 0.5 are more suitable constants for this collection. However, this is an experimental approach, and the results may not be very correct, but that what I think is suitable based on the information I got from the textbook and the output of the indexer.

References

  • UoPeople (2023). CS3308: Information Retrieval. Unit 2: Written Assignment.
  • Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/05comp.pdf. Chapter 5: Index Compression.