Skip to content

DA2. Hadoop

Statement

Hadoop has become a popular technology for Big Data and Analytics applications. As part of your response for this unit’s discussion question, describe what Hadoop is and how it functions. Further discuss why Hadoop is such an important analytics technology.

Solution

According to (AWS, n.d.), Apache Hadoop is an open-source framework created in 2006 for storing and processing large datasets using clusters of computers (hardware nodes), where Hadoop is a software that runs on all of the connected nodes and provides a distributed file system that spans across these nodes and split/schedule jobs to utilize the processing power of all the nodes in the cluster.

Hadoop consists of four main modules (Stedman, n.d.):

  • Hadoop Common: contains libraries and utilities needed by other Hadoop modules.
  • Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to data stored across cluster nodes.
  • Hadoop YARN: a framework for job scheduling and cluster resource management, thus it distributes processes and jobs across the cluster nodes.
  • Hadoop MapReduce: runs a job across the cluster nodes (by the help of YARN) and then combines the results into a single output, that can be stored in HDFS or passed to another processes.

Hadoop is known for its flexibility in accepting structured, semi-structured, and unstructured data, it is ability to scale up to large number of nodes (AWS, n.d), and its fault tolerance, that is, if a node fails, the job will be automatically rescheduled to run on another node while the entire process is still running (Stedman, n.d.). Hadoop is also flexible in running real-time and batch processing jobs, interactive querying, stream processing, and real-time analytics (Stedman, n.d.).

As the functionality that Hadoop provides is very flexible and scalable, it empowers data scientists to perform data mining and machine learning tasks on large datasets,thus, its wide usage in industries like customer analytics, risk management, operational intelligence, supply chain management, and so on (Stedman, n.d.).

Hadoop also suffers from some drawbacks, such its performance issues (writing to disks), high costs (cpu, memory, and storage are not separated in nodes), complexity (it is not easy to setup and maintain) (Stedman, n.d.).

References