3. Index Compression¶

Introduction ¹¶

CPU is faster than disk, it remains idle while waiting for disk I/O.
Transferring data from disk to memory is slow.
Thus, transferring less data data (compressed), then decompressing it in memory is faster than transferring more data (uncompressed).
Many preparation steps considered lossy: stemming, stop words, case folding, number removal, etc.
Power law: estimate space requirements for a given information retrieval system.
Unicode is replacing ASCII as the standard character encoding scheme.
- ASCII: represents each character in 1 byte (8 bits).
- Unicode: represents each character in 1 byte (UTF-8), 2 bytes (UTF-16), or 4 bytes (UTF-32).
- Nowadays, most systems use UTF-8
Heap’s law: Estimate the number of unique terms in a collection, aka, the size of the vocabulary of a collection.
Zipf’s law:
- Estimate the frequency of terms in a collection.
- Each term frequency is lesser than the previous term frequency by a constant factor.
Compressing Dictionary: Dictionary-as-string:
- Store the dictionary as long string.
- Each word ends with a pointer to the next word, and a pointer to the posting list, and its frequency.
- Save up to 60% of the space.
Compressing Dictionary: Blocking:
- Instead of storing pointers after every word, store pointers after every k words.
- Saves the pointers space.
- K is chosen to be small, so that scanning the block is not a problem.
- If the term is within this block, then scanning the entire block to find it is not a problem.
- If the term is not in this block, the pointer will point to the next block, and so on.

Index Compression ²¶

Compression ratios of 1:4 are easy to achieve, potentially cutting the cost of storing the index by 75%.

5.1 Statistical properties of terms in information retrieval¶

The rule of 30 states RULE OF 30 that the 30 most common words account for 30% of the tokens in written text.
Eliminating the 150 most common words from indexing cuts 25% to 30% of the nonpositional postings.
But, although a stop list of 150 words reduces the number of postings by a quarter or more, this size reduction does not carry over to the size of the compressed index.
Lossy compression:
- Stemming, stop words, case folding, number removal, etc.
- Vector space model, dimensionality reduction, latent semantic indexing, etc.
- Useful in search engines as user queries are short, and the user only cares about the first few results, so the some postings that are far down the list can be discarded.
5.1.1 Heaps’ law: Estimating the number of terms:
- M = kT^b Where:
  - M is the number of distinct terms in the collection.
  - T is the number of tokens in the collection.
  - k and b are parameters that depend on the collection. 30 <= k <= 100, and 0.4 <= b <= 0.6.
- The dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached.
- The size of the dictionary is quite large for large collections.
5.1.2 Zipf’s law: Modeling the distribution of term:
- Help us understand how terms are distributed across documents.
- So if the most frequent term occurs cf1 times, then the second most frequent term has half as many occurrences, the third most frequent term a third as many occurrences, and so on.
- log cfi = log c + k log i where:
  - cfi is the frequency of the ith most frequent term.
  - c is a constant.
  - k is a constant, usually -1.
  - i is the rank of the term in the frequency table.

5.2 Dictionary compression¶

The main goal of compressing the dictionary is to fit it in main memory, or at least a large portion of it, to support high query throughput.
Using fixed-width entries for terms is clearly wasteful, as it hast to fit the largest term, while most terms are much smaller than 20 bytes.
(20 + 4 + 4) = 28 bytes for each term in fixed-width arrays.
- 20 bytes for the term itself.
- 4 bytes for the frequency of the term.
- 4 bytes for the pointer to the posting list.
5.2.1 Dictionary as a string:
- Saves up to 60% of the space compared to fixed-width arrays.
- All terms are stored in a single string, not separated by any delimiters.
- The index does not include the term itself, however it includes a pointer to the term (the position of the term in the string).
- (8 + 4 + 4 + 3) = 19 for each term in generated dictionary string.
  - 8 bytes for the term itself.
  - 4 bytes for the frequency of the term.
  - 4 bytes for the pointer to the posting list.
  - 3 bytes for the term pointer.
5.2.2 Blocked storage:
- Instead of storing pointers for every term, store pointers for every k terms.
- Usually k is chosen to be small, and k = 4 is a good choice.
- The term string also includes the length of the term before the term itself, so that the term can be extracted from the string and each term is separated from the next by a delimiter.
- (8 + 4 + 4 + ¾ + 1) = 17,75 for each term in generated dictionary string, and k = 4.
  - 8 bytes for the term itself.
  - 4 bytes for the frequency of the term.
  - ¾ bytes for the pointer to the posting list, as we store the pointer for every k=4 terms.
  - 1 byte for the length of the term, that is added before every term in dictionary string.
- Increasing k reduces the number of pointers, gives us better compression, but it increases the time to find a term in the dictionary as scanning the block takes longer.
- Because of need to scan the block to get the exact term, more steps are needed to find the term than in the dictionary-as-string method (25% more steps).
Front coding:
- Consecutive entries in an alphabetically sorted list share common prefixes.
- E.g. Automata, Automate, Automatic, Automation share the prefix Automat and these terms can be stored in blocked storage as 8automat*a1@e2@ic3@ion
- Storing the common prefix on each term is wasteful; instead, we store the common prefix once, and then store the rest of the term.
Other compression schemes:
- Minimal perfect hashing:
  - A hash function that hashes every term to a unique integer from [0, N-1], where N is the number of terms in the dictionary.
  - Inserting a new term, requires rehashing all the terms, thus it is not practical for dynamic collections.

5.3 Postings file compression ³¶

Storing Gaps ⁴:
- To store large numbers, we can store the first number, then number to get to the next number (the gap), and so on.
- So to store 33, 47, 154, 159, 202, we store the gaps 33, 14, 107, 5, 43.
- You see the gaps are smaller numbers, thus they take less space.
- To retrieve the number 154, we start by 33 and then add the gaps (14 + 107) until we reach 154.
Variable encoding uses lesser bytes to store short numbers, and more bytes to store large number, rather than using fixed number of bytes that fits all numbers.
To encode small numbers in less space than large numbers:
- Two types of methods: bytewise compression and bitwise compression.
- As the names suggest, these methods attempt to encode gaps with the minimum number of bytes and bits, respectively.
5.3.1 Variable byte codes:
- Each byte consist of 8 bits:
  - Continuation bit:
    - The first bit.
    - 1 if the byte is the last byte of the gap, 0 if not.
    - It as a flag t to indicate whether the byte is the last byte of the gap or not.
  - Payload:
    - The last 7 bits of each byte.
    - They store the gap or part of the gap.
- We start reading the first byte and continue until we reach a byte with continuation bit = 1; that’s where the number ends.
- For most IR systems variable byte codes offer an excellent tradeoff between time and space, and simple to implement.
- Nibble: half a byte, 4 bits.
- Word: 32 bits, or 4 bytes, or 8 nibbles.
5.3.2 γ codes ³:
- Unary Code:
  - Encode a number n as n-1 1s followed by a 0.
  - E.g. 5 is encoded as 11110 = 5 1s + 0.
  - E.g. 10 is encoded as 11111111110 = 10 1s + 0.
- γ codes (Gamma code)
  - implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset.
  - The offset is the binary representation of G, with the most significant bit removed.
  - The length is the unary representation of the number of bits in the offset.
  - Gamma code: unary(length(offset)) offset
  - E.g. G=13.
    - Binary representation of G: 1101.
    - Offset: 101 (remove the most significant bit which is 1).
    - Length: 3 -> 1110 (unary representation of 3, 3 1s followed by zero).
    - Gamma code: 1110,101 (length + offset, comma is for readability).
  - The length of the gamma code is 2log(G) + 1.
  - We start reading 1s until we reach a 0, that’s where the length ends, then we read the offset according to the length.
  - How to read stored Gamma codes E.g. 1110,101:
    - we read 1s until we reach 0, so we get 1110, that represents 3,
    - So we read the next 3 bits 101, and that’s the end of the gamma code.
    - Then the original number is restored by adding 1 to the offset. that is 1,101 -> 1101 -> 13.

References¶

UoPeople (2020) CS3308: Information Retrieval. Unit 3: Introduction & Video lectures. https://my.uopeople.edu/course/view.php?id=7253#section-3 ↩
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/05comp.pdf. Chapter 5: Index Compression. ↩
Oresoft (2014). Web Data Mining. YouTube. https://www.youtube.com/playlist?list=PL0174E49C0E0DD5C8 ↩↩
Venkatesh Vinayakarao. (2022). Information Retrieval. YouTube. https://www.youtube.com/playlist?list=PLg38siQiGwWHdLUMo_UIPlYTh5cxTWRSB. Video 9: IR Course Lecture 9: Index Compression. https://www.youtube.com/watch?v=SU5a4wkWdHg&list=PLg38siQiGwWHdLUMo_UIPlYTh5cxTWRSB&index=9 ↩

3. Index Compression¶

Introduction 1¶

Index Compression 2¶

5.1 Statistical properties of terms in information retrieval¶

5.2 Dictionary compression¶

5.3 Postings file compression 3¶

References¶

Introduction ¹¶

Index Compression ²¶

5.3 Postings file compression ³¶