5. Persistence 1¶

Making information persist, despite computer crashes, disk failures, or power outages is a tough and interesting challenge.

36. I/O Devices¶

Canonical Device¶

Consists of an interface and a controller(internal structure).
The interface is unified (standardized), but the internal structure varies from device to device.

Canonical Device

The Canonical Protocol¶

The device interface has 3 registers: STATUS, COMMAND, and DATA.
- Status register: contains the status of the device.
- Command register: contains the command to be executed by the device.
- Data register: contains the data to be read or written by the device (argument or result of the command).
CPU keeps polling the device to see if it is ready to accept a command or if it has finished executing a command.
Polling is a waste of CPU cycles, but it is simple and reliable.

While (STATUS == BUSY)
; // wait until device is not busy
Write data to DATA register
Write command to COMMAND register
(starts the device and executes the command)
While (STATUS == BUSY)
; // wait until device is done with your request

Lowering CPU Overhead With Interrupts¶

Replace polling with interrupts.
The OS puts the issuing process into a sleep state until the I/O device has finished its work; When the device is done, it raises a hardware interrupt the CPU, which causes the OS to jump into the interrupt handler (Interrupt Service Routine, ISR, some OS special code). The handler reads the interrupt and responds accordingly.
Interrupts thus allow for overlapping of computation and I/O.
This is not always the most efficient solution. if the I/O device is very fast, by the time the CPU is done context-switching to the other process, the I/O device is already done with its work and the interrupt handler will execute and context-switch back to the original process. This increases the overhead of interrupts and proves that polling is more efficient.
If the I/O device is slow, use interrupts. The CPU will be able to do other work while the I/O device is busy.
If the I/O device is fast, use polling. No point in context-switching since the I/O device will return quickly.
If the I/O device speed is unknown, use hybrid; use polling for a while, then switch to interrupts (if the device is still busy).
Avoid interrupts in network I/O; if the CPU has to handle an interrupt for each packet, it will be overwhelmed with the number of interrupts and never execute any user code. The issue is called Live Lock.
Another optimization is to use Interrupts Coalescing; the device can raise an interrupt only when it has finished a certain number of operations (e.g. 1000); thus, for every 1000 operations, the device will raise only one interrupt. This reduces the number of interrupts and the overhead of handling them; but introduces a delay in the response time.

Data Movement With DMA¶

DMA (Direct Memory Access) is a device within the system that is responsible for transferring data between devices and main memory without the intervention of the CPU.
OS would start the DMA engine by providing the following:
- Where the data lives in memory
- How much data to transfer
- which device to transfer the data to
The DMA engine would then transfer the data from the device to memory or vice versa, without the CPU’s intervention.
During this time, the CPU can do other work.
When DMA is done, it raises an interrupt to the CPU, which can then handle the interrupt and continue with its work.

How the OS communicates with the device¶

The OS uses two methods to communicate with devices:
- Explicit I/O instructions:
  - OS sends data to specific device registers.
  - This is a privileged operation; only the OS is allowed to communicate with these devices.
- Memory-mapped I/O:
  - OS maps the device’s memory to a specific location in the main memory.
  - The OS can then read and write to this location as if it was a regular memory location.
  - This is a non-privileged operation; any process can read and write to this location.

Device Drivers¶

Device drivers are special programs that are responsible for communicating with devices.
The Driver becomes part of the OS kernel, it works in kernel mode as an abstraction layer between the OS and the device.

Device Drivers

37. Hard Disk Drives¶

The drive consists of a large number of sectors (512-byte blocks), each of which can be read or written.
We can view the disk as an array of sectors; 0 to n − 1; thus the address space of the drive.
Multi-sector operations are possible; indeed, many file systems will read or write 4KB at a time (or more).
Only one single sector update is atomic;
Disk Cache. aka. Track Buffer. a small amount of memory (10 - 16 MB) in the disk controller that is used to store recently read or written sectors.

Hard Disk components¶

Platters. Aluminum disk surfaces with a magnetic coating.
Spindle. The central shaft that holds the platters; is constantly rotating at an RPM (rotations per minute) speed. current average speeds (7200 -15000 RPM).
Track. A circular path on the platter that the read/write head moves along.
Disk Arm + Head. The arm is attached to the spindle and moves along the tracks. The head is attached to the arm and moves along the track.

I/O Time¶

Seek Time. When issuing a read or write command, the disk arm must move to the correct track. This delay is called seek time.
Rotational Delay. After the disk arm has moved to the correct track, the head must wait for the correct sector to pass under it, this is called rotational delay. Also known as Single Track Latency.
Seek. is the process of moving the disk arm to the correct track. The seek consists of several phases: acceleration, coasting, deceleration, and finally settling.
Settling Time. is the time spent after the disk has reached a track to carefully position the head over the track. this time is significant as it takes 0.5 - 2ms.
Transfer Time. is the time it takes to read or write a sector.
The total time for an I/O operation is the sum of the seek time, rotational delay, and transfer time.

Types In Hard Disks¶

Two types of writing to the disk:
- Write Back. aka, Immediate Reporting. Faster. The disk caches the update (only updates its cache memory) and reports it as completed; while the actual update is done later in the background.
- Write Through. the disk updates the sector immediately and reports it as completed.
Two types of workloads (the way the disk is used works):
- Random Workload. the disk is used for small, random reads and writes. The disk arm must move to the correct track for each read or write. Common in DBMS.
- Sequential Workload. the disk is used for large, sequential reads and writes. The disk arm can move to the correct track once and then read or write multiple sectors in a row.
Sequential Workload disks have higher transfer rates; thus considered better.

I/O OPs analysis¶

Total I/O Time. $T_{I/O} = T_{Seek} + T_{RotationalDelay} + T_{Transfer}$
Transfer I/O Rate. $R_{I/O} = \frac{{Size}_{Transfer}}{T_{I/O}}$
Examples of Hard Disk specs are shown below:
To analyze the disks in the image:
- T seek is already reported by the manufacturer.
- T RotationalDelay can be derived from the RPM as follows: $T_{RotationalDelay} = \frac{1}{RPM}$ (in minutes, then convert to seconds).
- T transfer can be derived from transfer rate (speed) as follows: $T_{Transfer} = \frac{Size Transfer}{Transfer Rate}$
From the image above:
- Cheetah has more RPM and Cache; thus, it has a higher I/O rate.
- Barracuda has a focus on capacity; thus, it has a higher capacity.

Disk Scheduling¶

The Disk Scheduler is a software component that decides which request service next; that is, to organize read/write requests in a way that minimizes the total I/O time.
Disk Scheduler follows the Shortest Seek Time First (SSTF) algorithm which is a greedy algorithm similar to the Shortest Job First (SJF) algorithm.
The scheduler can precisely predict the total I/O time for each request; since the location of the head and the location of the next request are known.
Problems with SSTF:
- OS does not know the exact disk structure; can be solved using the Nearest-Block-First (NBF) algorithm, where OS sends the requests that are close in the disk together.
- Starvation: Long requests may not have a chance. solved by using Elevator (a.k.a. SCAN or C-SCAN)

Elevator Algorithm (SCAN)¶

A Sweep is a single pass of the disk arm from one end of the disk to the other (passing over all tracks).
The scheduler examines the incoming requests:
- If the request is in the same direction as the current sweep (the current sweep has not passed over the track yet), it is serviced immediately (queued for the current sweep).
- If the request references a block belonging to a track that has already been passed over, it is queued until the next sweep.
Variants of the algorithm:
- F-SCAN: the requests that come during a sweep are all queued for the next sweep. The scheduler analyzes the pending queues only between sweeps. Avoids starvation of far-away requests.
- C-SCAN: The disk arm only moves in one direction, from inner to outer tracks; once it reaches the end of the outer tracks, it moves back to the center and starts again from inner to outer. This algorithm is fair to all requests.

SPTF: Shortest Positioning Time First¶

Sometimes called shortest access time first or SATF.
It depends on the relative time of seeking as compared to rotation:
- If the seek is faster than rotation, you always do the seek first (the request that is the farther track from the head)
- If the rotation is faster than seek, you always do the rotation first (the request that is the closest track from the head)

I/O merging¶

When the OS issues multiple requests to the disk; the disk should look through them and merge the request that belongs to sequenced or adjacent sectors.
The Disk Scheduler is responsible for merging requests.

38. Redundant Arrays of Inexpensive Disks (RAIDs)¶

Internally, the RAID is a complex beast, consisting of multiple disks, memory (both volatile and non-), and one or more processors to manage the system.
A hardware RAID is very much like a computer system, specialized for the task of managing a group of disks.
Advantages of RAID:
- Performance: more disks in parallel means more I/O operations per second.
- Capacity: more disks means more storage.
- Reliability: if one disk fails, the others can continue to operate.
- Transparency: the user sees a single disk, not a group of disks.
- Deployment: the new RAID array is completely compatible with the old one, no hardware or software changes are required.
RAID system components:
- Standard connection to the host computer (SATA, SCSI, etc.)
- MicroController that runs firmware to direct the RAID operations.
- Volatile memory (DRAM). to buffer data blocks as they are read or written.
- Non-volatile memory. to safely buffer write operations.
Evaluating RAID performance:
- Single-request latency. The time it takes to service a single request.
- Steady-state throughput. the total bandwidth of all concurrent requests.

RAID Level 0: Striping¶

No redundancy, just striping.
A stripe is a group of blocks that are written to different disks.
Chunk size is the size of the stripe.
Effects of chunk size:
- smaller chunk sizes mean that files are more likely to get striped across multiple disks, thus more parallelism in reading and writing; on the other hand, positioning time to access blocks across multiple disks increases.
- larger chunk sizes mean that files are more likely to be stored on a single disk, thus less parallelism in reading and writing; on the other hand, positioning time to access blocks on a single disk decreases.
Striping is good for capacity and performance, but BAD for reliability.

RAID Level 1: Mirroring¶

Mirroring is a redundancy technique that duplicates data on multiple disks.
Each block is written to multiple disks to tolerate disk failures.
Types of mirroring:
- RAID 01. Stripe of mirrors. Each stripe is of pair of actual/mirrored disks. aka, actual and mirrored disks are adjacent to each other. disk 1 is actual, disk 2 is a mirror of disk 1; disk 3 is actual, disk 4 is a mirror of disk 3; etc.
- RAID 10. Mirror of stripes. The array is divided into 2 halves, where the first half contains all actual disks, while the second half contains all mirrored disks. disk 1 is actual, disk 2 is actual, disk 3 is a mirror of disk 1, disk 4 is a mirror of disk 2; etc.
Mirroring is good for reliability, but BAD for capacity and performance (reads are fine, but writes must be replicated to all mirrors).

RAID Level 4: Saving Space With Parity¶

The number of 1s in any row, including the parity bit, must be an even (not odd) number; that is the invariant that the RAID must maintain in order for parity to be correct.
Parity is a redundancy technique that saves space by storing the sum of bits in multiple disks.
When one disk fails, the parity bit can be used to reconstruct the missing data.
Parity is good for capacity and reliability, but BAD for performance (the sum must be recalculated and saved every time).
Parity tolerates one disk failure; if more than one disk fails, the parity bit is no longer correct.

RAID Level 5: Rotating Parity¶

Usual Parity; however, the parity bit is not stored on a single disk but rotated across all disks.
This rotation helps to spread the load of parity calculations across all disks.