Skip to content

JA7. File Processing and External Sorting

Statement

The Learning Journal is a tool for self-reflection on the learning process. In addition to completing directed tasks, you should use the Learning Journal to document your activities, record problems you may have encountered and to draft answers for Discussion Forums and Assignments.

Your learning journal entry must be a reflective statement that considers the following questions

1. Describe what you did. This does not mean that you copy and paste from what you have posted or the assignments you have prepared. You need to describe what you did and how you did it

This was the 7th week of the course; I started on Sunday reading the chapter 8 from the Shaffer book, then I tried the self-quiz and solved the discussion forum question. There was no assignment this week. The week was about file processing and external sorting.

External sorting is sorting very large set of data that can not all fit in memory at once, so that this data must brought from disk in portions, sorted and then saved back to the disk; at the end theses portions must be treated (merged) to produce the final sorted data.

File processing is the operations that applications (through system calls to the operating system) can do on files, like reading, writing, appending, deleting, renaming, etc. These operations are governed at both ends of the disk (through disk drivers) and the application (through the operating system).

2. Describe your reactions to what you did

I learned about the disks, disk drivers, and file systems in previous courses but I did not fully understood them as there are so many details; this week builds up upon my previous knowledge and I learned more about the file systems and how they work.

The external sorting is a very interesting topic, I did not know that it is possible to sort data that is larger than the memory, and I did not know that there are so many algorithms to do that; which caused me to wonder how DBMSs had gave us the impression that the issue is simple where we simply got sorted data for granted.

3. Describe any feedback you received or any specific interactions you had. Discuss how they were helpful

The discussion assignment asked us to validate the some file processing operations related to buffering of pages of data from disk into memory. The task is to use a run a code snippet that is doing buffering in multiple technique (heuristics) and compare each technique in terms of the number of disk accesses and performance.

I did not talk about page faults in my answer, but my classmates mentioned in their feedback (on various answers) that when application is trying to access a page that does not exist on the memory buffer of the file, it will rise a page fault interrupt where the OS interrupt handler response will be to load the page from the disk into the memory buffer. This is a very important point that I missed in my answer.

4. Describe your feelings and attitudes

I am feeling good about my understanding to disks, disk drivers, and file systems; I am also happy that I finally got a clear understanding of the asymptotic analysis in general and the difference between the big-O, big-omega, and big-theta notations in particular.

5. Describe what you learned

The lesson starts with an introduction to memory, and the difference between the primary memory (RAM) and the secondary memory (disks), then it goes to explain the disk drive architecture and the cost of a accessing a block of data from the disk, then it explains the file systems and their terminology and the differences between the windows-based and the UNIX-based file systems.

The second part of the chapter talks about buffering of disk blocks into main memory; where the interface of a buffer pool is discussed, then two ways of implementing these buffers are discussed: the buffer pool pointer and the improved buffer pool implementation. Three heuristics for buffer replacement are discussed: the least recently used (LRU), the least frequently used (LFU), and the first in first out (FIFO). The chapter goes on with a discussion about these heuristics and then the programmer view of a file.

The last section of the chapter speaks about the external sorting algorithms where it starts with a definition of the problem, then it discusses the use of usual quick sort or merge sort, then the text mentions the Replacement selection and its way of multi-way merging, that results in a sorted file.

6. What surprised me or caused me to wonder?

I was surprised how the DBMSs systems gave us the records sorted for granted even for very large set of databases that can not fit in memory; I was also surprised that there are so many algorithms to do external sorting.

I was also surprised how small changes in the memory replacement heuristics can have a huge impact on the performance of the application and the number of disk accesses; and how each situation is unique and there is not solution that fits all.

7. What happened that felt particularly challenging? Why was it challenging to me?

The external sorting algorithms are very challenging to me, as it is my first time reading about them; sorting in general is a huge topic and I am not sure if I will be able to fully understand it in the future.

8. What skills and knowledge do I recognize that I am gaining?

I took operating system courses before (memory and disks) and databases (manipulation of huge data sets that are larger than memory), but I did not understand the details of theses topics; I feel that I know more about what page fault is and what causing it, how the file system talks to the disk driver, and how Database Management Systems (DBMSs) are able to sort data that is larger than memory.

9. What am I realizing about myself as a learner?

I am a bit cool about things that I do not fully understand as I’m certain next courses will cover tha gaps.

10. In what ways am I able to apply the ideas and concepts gained to my own experience?

We have a project at work that is about sorting a set of data larger than memory, we use unstructured schema-less data No-SQL database, that sort data by primary keys; however, business is requiring data to be sorted on different keys.

11. Finally, describe one important thing that you are thinking about in relation to the activity

I am thinking about the internal structure of database management systems and how they are able to sort data that is larger than memory, and what algorithms they use to do that.

References