Skip to content

JA3. Learning Journal 3

Statement

Your learning journal entry must be a reflective statement that considers the following questions:

1. Describe what you did

This was the third week of this course, it was built on top of the previous weeks; I started this week on Sunday with the reading assignment and lecture videos; later I did the self-quiz, then the discussion assignment and now I am writing this journal entry.

2. Describe your reactions to what you did

The concept of compression is an interesting topic; especially when you have large data sets or corpus. The techniques that are used to compress the index are very smart and sometimes, doing a small change may have a huge effect on both the size and the performance of the index.

3. Describe any feedback you received or any specific interactions you had. Discuss how they were helpful

I did not receive any meaningful feedback that I can discuss here.

4. Describe your feelings and attitudes

I feel that I am more confident in understanding the benefits of compression, in-memory/disk indexing, one-pass vs multi-pass indexing, and the trade-offs between them. It is fascinating to see how smart people build algorithms that move data to/from memory/disk in a way that optimizes the performance of the index.

5. Describe what you learned

I started with the lecture and introduction notes (UoPeople, 2023), which talked about the power laws to estimate both the number of terms and the frequencies for these terms; then it moved to compression techniques like dictionary-as-string and storage blocking; however, I did not understand anything from the videos so I moved to the next resource.

(Maning et al., 2009) was the next resource, it talked about the rule of 30, lossy/lossless compression algorithms, dictionary compression (dictionary-as-string, blocked storage, front coding, and minimal perfect hashing), positing list compressions (variable byte encoding, gamma codes), but posting list compression was too complex to understand from the book, so I searched for more resources and found the next two useful resources.

(Oresoft, 2014) clearly explained gamma codes, how to extract the unary code, generate the gamma code by concatenating the offset with the unary code of the length of that offset; and how can we easily later decode the gamma code back to the original number. The main idea behind this is to use variable bits to store numbers, that is every number can take as much space as they need and no more.

(Venkatesh Vinayakarao, 2022) explained the variable byte encoding, which is a way to store gaps and give them exactly the storage they need without wasting; each byte has 7 bits as payload, and the first bit as a flag to indicate if the number is finished or shall we read one more byte.

6. What surprised me or caused me to wonder?

I was surprised by the results that my classmates got in the previous programming assignment, we all started with the same corpus and the same code, but the results were very different; to make things more surprising, there were a few groups of results that some students got; for example some students got the number of terms as 4280 -just like me-, while some other group got around 5500 unique terms.

7. What happened that felt particularly challenging? Why was it challenging for me?

The gamma codes or the variable bit encoding was super challenging; I had to watch lots of videos to understand it, and I think I just forgot it again; but there was a complex conversion happens on the numbers before storing them and upon reading them back; (Oresoft, 2014) suggested that this method is not the most efficient one, and thus I did not spend much time on it.

8. What skills and knowledge do I recognize that I am gaining?

I can now understand the trade-offs between the different compression techniques; and what’s the difference between storing data in the default fixed-length formats and using byte or bit variable encoding. I also learned about how to estimate the space and memory requirements for indexes and posting lists even before building them.

9. What am I realizing about myself as a learner?

I’m interested in the topic of compression, as I think all good products have an advantage over the bad ones because they manage the resources better; and compression is one of the ways to get the most out of storage/memory and thus improve the performance of the system.

10. In what ways am I able to apply the ideas and concepts gained to my own experience?

I’m interested in the topic of databases; DBMSs are the standard in my mind to compare with everything I learn; learning about indexing and compression caused me to think about how DBMSs store raw data and retrieve them later in a very efficient and fast way.

References