2. Big Data Tools, Techniques, and Systems¶
Hadoop - History or Evolution 1¶
- Hadoop is an open-source framework overseen by Apache Software Foundation which is written in Java for storing and processing huge datasets with the cluster of commodity hardware.
- There are mainly two components of Hadoop which are the Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator(YARN).
- Hadoop was started by Doug Cutting and Mike Cafarella in the year 2002 when they both started to work on the Apache Nutch project.
- In 2003, they came across a paper that described the architecture of Google’s distributed file system, called GFS (Google File System) which was published by Google, for storing large data sets.
- In 2004, Google published one more paper on the technique MapReduce, which was the solution for processing those large datasets.
- In December of 2011, Apache Software Foundation released Apache Hadoop version 1.0.
- Apache Hadoop version 3.0 which released in December 2017.
Introduction to Apache Spark 2¶
- Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.
- It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.
- Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab, a collaboration involving students, researchers, and faculty, focused on data-intensive application domains.
- A challenge to MapReduce is the sequential multi-step process it takes to run a job. With each step, MapReduce reads data from the cluster, performs operations, and writes the results back to HDFS. Because each step requires a disk read, and write, MapReduce jobs are slower due to the latency of disk I/O.
- Spark was created to address the limitations of MapReduce, by doing processing in memory, reducing the number of steps in a job, and reusing data across multiple parallel operations. With Spark, only one step is needed where data is read into memory, operations performed, and the results are written back resulting in a much faster execution.
- In a typical Hadoop implementation, different execution engines are also deployed such as Spark, Tez, and Presto.
- Spark does not have its own storage system; but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others.
- Spark on Hadoop leverages YARN to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response.
- The Spark framework includes:
- Spark Core is the foundation for the platform. It is responsible for memory management, fault recovery, scheduling, distributing and monitoring jobs, and interacting with storage systems.
- Spark SQL for interactive queries. It supports various data sources out-of-the-box including JDBC, ODBC, JSON, HDFS, Hive, ORC, and Parquet. Other popular stores—Amazon Redshift, Amazon S3, Couchbase, Cassandra, MongoDB, Salesforce.com, Elasticsearch, and many others can be found in the Spark Packages ecosystem.
- Spark Streaming for real-time analytics
- Spark MLlib for machine learning
- Spark GraphX for graph processing
Hadoop Illuminated 3¶
- Hadoop is built to run on a cluster of machines (nodes).
- Hadoop clusters scale horizontally: adding more nodes to the cluster; thus, no need to upgrade to more powerful machines.
- Hadoop can handle unstructured / semi-structured data by forcing a schema on the data it stores.
- Hadoop clusters provide storage and computing: Traditional machines separate storage and computing (programs must be loaded from storage to memory before they can be executed); Hadoop is a storage and computing cluster at the same time (data do not need to be moved to memory to start).
- Hadoop = HDFS + MapReduce.
7.3 Hadoop components¶
- HDFS: Hadoop Distributed File System: storage.
- It stores unlimited data by scaling horizontally.
- HDFS is designed to work with commodity hardware (cheap, off-the-shelf hardware).
- Fault-tolerant: HDFS replicates data across multiple nodes and can recover from node failures.
- HDFS implementation is modeled after GFS, Google Distributed File system.
- MapReduce: computation engine.
- It takes care of distributed computing.
- It reads data from HDFS in an optimal way, but it can also read data from other sources (local file systems, databases, web APIs, etc.).
- It divides the computations between different computers (servers, or nodes).
- It is also fault-tolerant: If some of your nodes fail, Hadoop knows how to continue with the computation, by re-assigning the incomplete work to another node and cleaning up after the node that could not complete its task.
- It also knows how to combine the results of the computation in one place.
- HBase: the database for Hadoop.
- It is a key-value database, not a relational database.
- key-value databases are considered more fitting for Big Data. Why? Because they don’t store nulls!
- HBase is a near-clone of Google’s Big Table.
- ZooKeeper: configuration management.
- ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Hive: data warehouse.
- Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, ad-hoc querying, and analysis of large datasets data stored in Hadoop files.
- Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data.
- Pig: data flow language.
- Pig is a high-level platform for creating programs that run on Hadoop.
- The language for this platform is called Pig Latin.
- Pig Latin abstracts the programming from the Java MapReduce idiom into a notation that makes MapReduce programming high-level, similar to SQL for databases.
7.4. Hadoop alternatives¶
- Large data storage alternatives (to HDFS):
- CEPH: a distributed storage system that stores all file information on a distributed NameNode, while HDFS uses the memory of a single NameNode to store file information (name, size, location, etc).
- ZFS: By Sun (now Oracle), it is a file system with built-in replication, low overhead, and built-in indexing for searching.
- GFS: Google File System, the inspiration for HDFS.
- Large database alternatives (to HBase):
- Cassandra: the closest to HBase. it is a “Big Table/Dynamo hybrid”.
- HyperTable: a clone of Big Table. It is 10 times faster than HBase.
- MongoDB: a document-oriented database written in C++.
- Vertica: a column-oriented database that supports SQL.
- CloudTran: a transaction manager for cloud databases.
- Spire: a distributed database.
- Large data processing alternatives (to MapReduce):
- JavaSpaces: it is a giant hash map container.
- GridGain: a distributed computing platform.
- Terracotta: a distributed cache.
8. Hadoop Distributed File System (HDFS)¶
- Problem 1: data is too big to fit on a single machine. HDFS solution: stores data on multiple machines that look like a unified file system to the user.
- Problem 2: high-end hardware is expensive. HDFS solution: uses commodity hardware.
- Problem 3: commodity hardware fails more often. HDFS solution: the software (Hadoop) handles hardware failures and recovers properly by taking corrective actions.
- Problem 4: hardware failures may lead to data loss. HDFS solution: data is replicated across multiple machines (default is 3 times).
- Problem 5: distributed node coordination is hard. HDFS solution: a ‘daemon’ running on each machine to manage storage for that machine. These daemons will talk to each other to exchange data. There is also a master NameNode that manages computation and storage across the cluster.
8.1. HDFS Architecture¶
- In an HDFS cluster, there is ONE master node and many worker nodes. The master node is called the Name Node (NN) and the workers are called Data Nodes (DN). Data nodes actually store the data. They are the workhorses.
- HDFS is resilient and runs on commodity hardware where data is replicated across multiple nodes.
- HDFS is better suited for large files (1 GB or more) and write-once-read-many (WORM) files:
- Seek time is high for disks, so the disk reader seeks less often for large files, thus improving performance.
- HDFS files are write-once only, no updates are allowed; however, appends are possible.
9. Introduction To MapReduce¶
- Map: The Map phase is the first phase of processing in MapReduce. It reads data from the input source, processes that data, and generates a set of key-value pairs. The Map phase is parallelized across multiple nodes in the cluster.
- Shuffle and Sort: The Shuffle and Sort phase is the second phase of processing in MapReduce. It takes the output of the Map phase and sorts the data by key. The Shuffle and Sort phase is parallelized across multiple nodes in the cluster.
-
Reduce: The Reduce phase is the final phase of processing in MapReduce. It reads the output of the Shuffle and Sort phase, processes that data, and generates a set of key-value pairs. The Reduce phase is parallelized across multiple nodes in the cluster.
-
9.2. How MapReduce does multi-node processing:
- MapReduce has a master and workers, but it is not all push or pull, rather, the work is a collaborative effort between them.
- The master assigns a work portion to the next available worker; thus, no work portion is forgotten or left unfinished.
- Workers send periodic heartbeats to the master. If the worker is silent for some time (usually 10 minutes), then the master presumes this worker crashed and assigns his work to another worker. The master also cleans up the unfinished portion of the crashed worker.
- All of the data resides in HDFS, which avoids the central server concept, with its limitations on concurrent access and size. MapReduce never updates data, rather, it writes new output instead. This is one of the features of functional programming, and it avoids update lockups.
- MapReduce is network and rack-aware, and it optimizes the network traffic.
- 9.3. How MapReduce really does multi-node processing:
- Masters and slaves:
- The master and worker machines know each other because their details are listed in a configuration file.
- The master runs a daemon called JobTracker, and the workers run a daemon called TaskTracker.
- The master does not divide all the work beforehand but has an algorithm on how to assign the next portion of the work.
- Thus, no time is spent up front, and the job can begin right away.
- This division of labor, how much to give to the next TaskTracker, is called “split”, and you have control over it. By default, the input file is split into chunks of about 64MB in size.
- MapReduce never updates data, rather, it writes a new output instead; this solves the issue of workers lock-in files, or compete to read/write the same file. Each Reducer (worker) writes its output to a separate file within the same HDFS directory.
- MapReduce optimizes network traffic:
- Bandwidth is limited, so MapReduce tries to minimize the data sent over the network.
- In the HDFS information, each file is associated with the IP address of the node that stores the file.
- The master tries to assign work to the TaskTracker that is closest to the node that stores the data.
- Thus, it would assign a TaskTracker to work on the files that are stored on the same IP address as the TaskTracker.
- This way, data does not need to be moved over the network, as the work is done on the same node.
- Masters and slaves:
11. Hadoop Distributions¶
- Benefits of Hadoop distributions:
- Distributions provide an easy way to install Hadoop as the raw Hadoop is raw Tar files.
- Distributions combine packages that work well together as the ecosystem is vast and things may not be compatible.
- Distributions come with testing and support.
- 11.2. Overview of Hadoop Distributions:
- Apache Hadoop: the raw Hadoop.
- Cloudera: the first commercial distribution of Hadoop.
- HortonWorks.
- MapR.
- Intel.
- Pivotal HD.
15. Hadoop Challenges¶
- Hadoop is a new technology, and as with adopting any new technology, finding people who know the technology is difficult!
- Hadoop is designed to solve Big Data problems encountered by Web and Social companies. In doing so a lot of the features Enterprises need or want are put on the back burner. For example, HDFS does not offer native support for security and authentication.
- The development and admin tools for Hadoop are still pretty new. Companies like Cloudera, Hortonworks, MapR and Karmasphere have been working on this issue. However, the tooling may not be as mature as Enterprises are used to (as say, Oracle Admin, etc.).
- Hadoop is not cheap: hardware cost + workforce cost + software cost + training cost + support cost = not cheap.
- Solving problems using Map Reduce requires a different kind of thinking. Engineering teams generally need additional training to take advantage of Hadoop.
- Availability is low:
- Hadoop version 1 had a single point of failure problem because of NameNode. There was only one NameNode for the cluster, and if it went down, the whole Hadoop cluster would be inoperable. This has prevented the use of Hadoop for mission-critical, always-up applications.
MongoDB History 4¶
- MongoDB was developed by the company 10gen in 2007.
- MongoDB is written in C++. See the first commit in MongoDB
- It was meant to store and serve a humongous (extremely large) amount of data required in typical use cases such as content serving.
- MongoDB 1.0 was released in February 2009:
- The initial version focused on providing a usable query language with a document model, indexing and basic support for replication.
- Sharding was added in version 1.6.
- Early MongoDB Design Philosophy:
- Quick and easy data model for faster programming - document model with CRUD.
- Use familiar language and formats - JSON and JavaScript.
- Schemaless documents for flexibility.
- Only essential features in the core database. No joins, transactions, stored procedures, etc.
- Support easy horizontal scaling and durability/availability: replication and sharding.
- The latest MongoDB server versions support joining and since MongoDB 4.2, even distributed transactions are supported!
- MongoDB is a document-based NoSQL database:
- MongoDB stores data entities in a container called a collection.
- Each piece of data stored is in a JSON document format.
- MongoDB also comes with a console client called MongoDB shell.
- MongoDB currently offers drivers for 13 languages including Java, Node.JS, Python, PHP and Swift.
- The storage engine MMAPv1 was removed since version 4.2.
- The encrypted storage engine is only supported in the commercial enterprise server.
- Shard Cluster for Horizontal Scalability:
- The minimum recommended number of machines for a fault-tolerant shard cluster is 14!
- Each fault-tolerant replica set in this case handles only a subset of the data.
- This data partitioning is automatically done by the MongoDB engine.
- The latest version of MongoDB (4.4):
- Mongo: The MongoDB Shell.
- MongoD: The MongoDB Server.
- MongoS: The MongoDB Sharding Router.
- By 2013, 10gen had over 250 employees and 1000 customers. Realizing the true business potential, 10gen was renamed MongoDB Inc. to focus fully on the database product.
- MongoDB Inc.’s first acquisition was WiredTiger, a company behind the super stable storage engine with the same name.
- MongoDB was 3.0 (March 2015) which featured the new WiredTiger storage engine, pluggable storage engine API, increased replica set member limit of 50 and security improvements. The same year Glassdoor featured MongoDB Inc. as one of the best places to work.
- In October 2017, MongoDB Inc. went public with over 1 billion dollars in market capitalization.
- So in a controversial move, MongoDB Inc. changed the license of the community version from GNU AGPLv3 (AGPL), to Server Side Public License(SSPL) in October 2018.
MongoDB Today (2020)¶
- MongoDB was downloaded 110 million times worldwide. MongoDB Inc.
- It currently has 2000+ employees and over 18,000 paying customers many of whom will be using both MongoDB Atlas and MongoDB Enterprise.
- The current version of MongoDB community server as of August 2020 is MongoDB 4.4.
- MongoDB enterprise server (seems to be priced around the range of $10k per year per server) offers the following additional features:
- In-memory storage engine.
- Auditing.
- Authentication and authorization: LDAP and Kerberos authentication.
- Encryption at rest.
- In addition to the community server, MongoDB Inc. offers the following products:
- MongoDB Atlas: A cloud-based database service.
- MongoDB Atlas Tools: A suite of tools for managing MongoDB Atlas (import/export mongoDump/mongoReStore, etc).
- MongoDB Enterprise Server.
- Atlas Data Lake.
- Atlas Search: A full-text search engine.
- MongoDB Realm: A mobile database.
- MongoDB Charts: A data visualization tool.
- MongoDB Compass: A GUI for MongoDB.
- MongoDB Ops Manager: On-premise management platform for deployment, backup and scaling of MongoDB on custom infrastructure.
- MongoDB Cloud Manager: A cloud version of Ops Manager.
- MongoDB Connectors: a set of drivers.
- The main problem for MongoDB Inc. is that data storage is just one part of the enterprise application landscape. Without a compelling full stack of cloud services, MongoDB may find it hard in the future to compete with cloud vendors.
Big Data Now 5¶
- Big data practitioners consistently report that 80% of the effort involved in dealing with data is cleaning it up in the first place.
- Predictive modeling use cases:
- Weather forecasting.
- Recommendation engines.
- Predict airline flight times.
- Predictive modeling is still a prediction; the drivetrain approach is more reliable.
- Objective-based Data Products: We are entering the era of data as a drivetrain, where we use data not just to generate more data (in the form of predictions); but use data to produce actionable outcomes.
- The four steps in the Drivetrain Approach:
- Define the objective.
- Levers what inputs can be controlled to achieve the objective.
- Collect data on the levers.
- Model how the levers affect the objective (Model Assembly Line):
- Modeler.
- Simulator.
- Optimizer.
- Actionable outcomes.
- Using a Drivetrain Approach combined with a Model Assembly Line bridges the gap between predictive models and actionable outcomes.
Advantages of Hadoop 6¶
- Open source,
- Scalable,
- Fault-Tolerant,
- Schema Independent,
- High Throughput and low latency,
- Data Locality: Hadoop does not move data to the processing unit, it moves the code (processing unit) to where the data is stored.
- Performance,
- Share Nothing Architecture: Each node in the cluster is independent and self-sufficient.
- Support Multiple Languages: Python, Ruby, Perl, Groovy, and Java.
- Cost-Effective,
- Abstractions,
- Compatibility and Support for Various File Systems.
Big Data Analytics on Apache Spark 7¶
- Spark comes with a multistage in-memory programming model compared to the rigid map-then-reduce disk-based model of Hadoop.
Hadoop - Introduction 8¶
Why MongoDB is so important for big data 9¶
- MongoDB is not to replace SQL databases but rather has a different use case.
- It is most useful when you have a large amount of unstructured data, or when the relationships between data are not clear.
- Because MongoDB omits the relationships, it can be faster than SQL databases, and even easier to find related data (by just searching for the reference id as opposed to joining tables).
Video Resources 10 11¶
Apache Spark - Introduction 12¶
Top 10 NoSQL Databases for Data Science 13¶
- MongoDB.
- Apache Cassandra.
- Redis.
- Apache CouchDB.
- Apache HBase.
- Amazon DynamoDB.
- ElasticSearch.
- Oracle NoSQL.
- Azure CosmosDB.
- and Couchbase.
History of Hadoop - The Complete Evolution of Hadoop Ecosystem 14¶
Why Use MongoDB and When to Use It? 15¶
References¶
-
Aggarwal, A. (2019, January 18). Hadoop – history or evolution. GeeksForGeeks. Retrieved: August 7, 2022, from https://www.geeksforgeeks.org/hadoop-history-or-evolution/ ↩
-
Introduction to Apache Spark. (n.d.). AWS. https://aws.amazon.com/big-data/what-is-spark/ ↩
-
Kerzner, M., &Maniyam, S. (2016). Hadoop Illuminated. Elephant Scale LLC. https://elephantscale.com/wp-content/uploads/2020/05/hadoop-illuminated.pdf licensed by CC BY-NC-SA 3.0 ↩
-
MongoDB history. (n.d.). Quick programming tips. https://www.quickprogrammingtips.com/mongodb/mongodb-history.html ↩
-
O’Reilly Radar Team. (2012). Big data now. O’Reilly Media, Inc. http://cdn.oreillystatic.com/oreilly/radarreport/0636920028307/Big_Data_Now_2012_Edition.pdf ↩
-
Pedamkar, P. (2022). Advantages of Hadoop. Educba. https://www.educba.com/advantages-of-hadoop/ ↩
-
Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3), 145-164. DOI: https://doi.org/10.1007/s41060-016-0027-9 ↩
-
Vivek5252 (2021, July 29). Hadoop - Introduction. GeeksforGeeks. Retrieved: August 7, 2022, from https://www.geeksforgeeks.org/hadoop-introduction/?ref=lbp ↩
-
LearnTek. (2017, April 11). Why MongoDB is so important for big data. https://www.learntek.org/blog/mongodb-important-big-data/ ↩
-
Simplilearn. (2021, January 21). Hadoop in 5 minutes | What is Hadoop? | introduction to hadoop | hadoop explained [Video]. YouTube. https://www.youtube.com/watch?v=aReuLtY0YMI ↩
-
Simplilearn. (2021, March 16). What is MongoDB? | What is mongoDB and how It works | mongoDB tutorial for beginners | [Video]. YouTube. https://www.youtube.com/watch?v=SnqPyqRh4r4 ↩
-
Apache spark - introduction. (n.d.). Tutorials Point. from https://www.tutorialspoint.com/apache_spark/apache_spark_introduction.htm ↩
-
Day, F. (2022, June 17). Top 10 NOSQL databases for data science. Classes near my blog. https://www.nobledesktop.com/classes-near-me/blog/top-nosql-databases-for-data-science ↩
-
History of Hadoop - the complete evolution of the Hadoop ecosystem. (2021, August 25). DataFlair. https://data-flair.training/blogs/hadoop-history/ ↩
-
Why Use MongoDB and When to Use It? (2022, February 8). MongoDB.com. https://www.mongodb.com/why-use-mongodb ↩