Skip to content

DA8. Zipf Distribution

Statement

Describe in your own words the zipf distribution, how it functions, and provide an example of where it can be used.

Solution

  • Zipf distribution is a discrete probability distribution that is used to model the frequency of occurrence of elements in a dataset; it was originated in 1935 and named after the American linguist George Kingsley Zipf (TechTarget, n.d.).
  • The distribution uses the rank term, which is the position of an element in a dataset when it is sorted in descending order of frequency, so that the most frequent element has rank 1, the second most frequent element has rank 2, etc (TechTarget, n.d.).
  • In lyman terms, There are a relationship between the rank of the element (its position) and its frequency of occurrence (how many times it appears in the dataset), where the frequency of occurrence of an element is inversely proportional to its rank and falls sharply as rank increases (TechTarget, n.d.).
  • the Zipf distribution applies to wide variety of data sets, including language texts, some population statistics, and even the distribution of Covid-19 cases (Boka et al., 2021).
  • In data structures, the Zipf distribution appears in self-organizing lists, Hash tables, and other Indexing techniques.
  • The text mentions that the Zipf distribution is used in self-organizing lists in the context of 80/20 rule, where -in practice- 20% of the elements are accessed 80% of the time (Shaffer, 2011).
  • This is helpful in supporting sequential search by re-arranging the list to put the most frequently accesses 20% of elements at the beginning of the list, so that the search will be faster. But if we don’t know access patterns in advance this would have little use (Shaffer, 2011).

References