WA2. Hadoop¶
Statement¶
- Describe three main components of Hadoop, and summarize at least three main principles behind each of these three components
- Lastly, discuss the importance of each of the components mentioned for processing big data.
Answer¶
Introduction¶
Hadoop is an open-source framework written in Java for storing and processing data in a distributed manner using horizontal scaling through clustering of commodity hardware. It was started in 2002 by Doug Cutting and Mike Cafarella, and the V1.0 was released in 2011 (Aggawral, 2019).
Hadoop has four main components: Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), MapReduce, and Hadoop Common Utilities (Kiran, 2019). In this text, we will discuss three main components of Hadoop, the principles behind these components, and their importance for processing big data.
Principles Behind Hadoop Components¶
Hadoop Distributed File System (HDFS) is the component responsible for storing data in a distributed manner across multiple nodes in a Hadoop cluster. HDFS has many principles behind it; here are a few of them (Mesika, 2023):
- Horizontal scaling: it is easy to add more storage capacity by adding more nodes to the cluster.
- Fault tolerance: data is replicated across multiple nodes, 3 by default, to ensure no data loss in case of node failure as one of its replicas will take over.
- High throughput: HDFS is optimized by dividing data into smaller chunks (blocks) and storing them across multiple nodes, then processing them in parallel.
- Read-Only files: once files are written, they cannot be modified, but rather appended to them only. This is necessary to ensure data integrity and correct parallel processing.
Yet Another Resource Negotiator (YARN) is the component responsible for managing resources and scheduling tasks in a Hadoop cluster. It runs on the master node and communicates with daemons on worker nodes to manage resource allocations. Here are a few principles behind YARN (Palaknarula, 2018):
- Compatibility: YARN needs to be compatible with different applications, plugins, and frameworks that run on top of it; along with supporting older versions of MapReduce.
- Resource Reservation: the user configures the resource allocation and gives certain jobs priority over others to meet constraints and deadlines.
- Federation: the cluster can be divided into multiple sub-clusters, each with its own Resource Manager but still appear as a single unit to the outer world. this is useful to manage resources more efficiently.
MapReduce is the component responsible for processing data in a Hadoop cluster. It is a programming model that allows developers to write parallel processing jobs that can be run on a Hadoop cluster. Here are a few principles behind MapReduce (Kerzner & Maniyam, 2016):
- Data locality: data is processed within the same node where it is stored or the closest node to it; this is useful to reduce the need for data to travel across the network.
- Parallel processing: data is divided into chunks and worker nodes process them in parallel, then the results are combined to produce the final output.
- Simplified programming model: the component abstracts the complexity of parallel processing and distributed computing, allowing developers to focus on the business logic of their applications.
Importance of Hadoop Components for Processing Big Data¶
Hadoop Distributed File System (HDFS) is important for processing big data; since the volume of big data is too large to be handled by a single machine, HDFS allows combining the storage capacity of multiple machines but still appearing as a single file system. The hardware used is commodity hardware, which reduces the cost and allows for storing large amounts of data while ensuring fault tolerance, high throughput, and availability.
Yet Another Resource Negotiator (YARN) is important for processing big data; as it allows managing resources efficiently and scheduling tasks across multiple nodes in a Hadoop cluster. YARN is sometimes called the operating system of Hadoop, and it works like a single operating system for all nodes so the cluster looks like a single processing unit. Its configurability and support for a wide range of applications and frameworks make it suitable for many use cases, even the specialized ones.
MapReduce is important for processing big data; as it allows developers to write parallel processing jobs that can be run on a Hadoop cluster. The data locality and parallel processing principles behind MapReduce make it suitable for processing large amounts of data in a distributed manner. This is useful for use cases that require complex data processing and analysis, such as machine learning, data mining, and business intelligence.
Conclusion¶
The text discussed the principles and the importance of three of the main components of Hadoop: HDFS, YARN, and MapReduce. Hadoop is an essential tool in the big data ecosystem and is supported by a huge community that keeps developing plugins and packages to support more use cases.
References¶
- Aggarwal, A. (2019, January 18). Hadoop – history or evolution. GeeksForGeeks. Retrieved: August 7, 2022, from https://www.geeksforgeeks.org/hadoop-history-or-evolution/
- Kerzner, M., &Maniyam, S. (2016). Hadoop Illuminated. Elephant Scale LLC. https://elephantscale.com/wp-content/uploads/2020/05/hadoop-illuminated.pdf licensed by CC BY-NC-SA 3.0
- Kiran, R. (2019, June 21). Hadoop Components that you need to know about | Edureka. Edureka. https://www.edureka.co/blog/every-hadoop-component/#hadoop
- Mesika, O. (2023, August 27). Hadoop Architecture: 4 Key Components & Designing Your Cluster. Intel Granulate. https://granulate.io/blog/hadoop-architecture-4-key-components-designing-your-cluster/
- Palaknarula. (2018, December 11). Hadoop YARN Architecture. GeeksforGeeks. https://www.geeksforgeeks.org/hadoop-yarn-architecture/