4. Querying Big Data¶
Introduction 1¶
- More Info:
- RDD Programming Guide - Spark 3.5.1 Documentation. (2024). Apache.org. https://spark.apache.org/docs/latest/rdd-programming-guide.html#using-the-shell
- Spark SQL and DataFrames - Spark 3.5.1 Documentation. (2024). Apache.org. https://spark.apache.org/docs/latest/sql-programming-guide.html
- PySpark Overview — PySpark master documentation. (2024). Apache.org. https://spark.apache.org/docs/latest/api/python/index.html#pyspark.sql.DataFrame
- Cluster Mode Overview - Spark 3.5.1 Documentation. (2024). Apache.org. https://spark.apache.org/docs/latest/cluster-overview.html
Apache Spark: Quick Start 2¶
- Before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD).
- After Spark 2.0, RDDs are replaced by Dataset, which is strongly typed like an RDD, but with richer optimizations under the hood.
- Dataset:
- A Dataset is a distributed collection of data.
- Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
- Dataset API is available in Scala and Java only.
- Python’s dynamic nature makes it easy to work without a dedicated API.
- Dataframe:
- A Dataframe is a Dataset organized into named columns.
- It is conceptually equivalent to a table in a relational database or a data frame in R/Python.
- Dataframes can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.
- Dataframes are available in 4 languages: Scala, Java, Python, and R.
Spark SQL and DataFrames: Introduction to Built-in Data Sources 3¶
- All SQL queries can be expressed with an equivalent DataFrame API query. For example, the first query can be expressed in the Python DataFrame API as:
# SQL query
spark.sql("""SELECT distance, origin, destination
FROM us_delay_flights_tbl WHERE distance > 1000
ORDER BY distance DESC""").show(10)
# DataFrame API query
from pyspark.sql.functions import col, desc
(df.select("distance", "origin", "destination")
.where(col("distance") > 1000)
.orderBy(desc("distance"))).show(10)
- Instead of having a separate metaStore for Spark tables, Spark by default uses the Apache Hive metaStore, located at /user/hive/warehouse, to persist all the metadata about your tables.
- Spark allows you to create two types of tables: managed and unmanaged:
- Managed tables:
- Spark manages the lifecycle of the table, including the data and metadata.
- Data can be stored in a local file system, HDFS, or an object store such as Amazon S3 or Azure Blob Storage.
- Spark can do anything on both the metadata and the data.
- Unmanaged tables:
- Spark manages the metadata, but the data is stored on an external data source.
- External data sources can include Cassandra.
- Spark can do anything on the metadata, but not the data itself.
- Spark does not own the data and cannot delete it or modify it, but it can read it.
- Managed tables:
- Tables reside within a database. By default, Spark creates tables under the default database.
Spark SQL Tutorial 4¶
- Spark supports querying data either via SQL or via the Hive Query Language.
- Spark SQL integrates relational processing with Spark’s functional programming.
- SQL was built to overcome these drawbacks and replace Apache Hive.
- Spark SQL is faster than Hive when it comes to processing speed.
- Limitations With Hive:
- Hive launches multiple MapReduce jobs to execute a single query. MapReduce jobs are slow and lag in performance.
- Hive has no resume capability. If a job fails, it has to be restarted from the beginning.
- Hive cannot drop encrypted databases in cascade mode when the trash is enabled. The purge option must be used which skips the trash.
- Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.
- With Spark SQL, Apache Spark is accessible to more users and improves optimization for the current ones.
- Spark SQL provides DataFrame APIs that perform relational operations on both external data sources and Spark’s built-in distributed collections.
- It introduces an extensible optimizer called Catalyst as it helps in supporting a wide range of data sources and algorithms in big data.
- Spark SQL Libraries:
- Data Source API: It is a pluggable API that allows Spark to access structured data through Spark SQL.
- Dataframe API: It is a distributed collection of data organized into named columns.
- SQL Interpreter and Optimizer: It is a module that is used for structured data processing.
- SQL Service: It is a service that is used to execute SQL queries.
- Features Of Spark SQL:
- Integration with Spark.
- Uniform Data Access: like Hive, Avro, Parquet, ORC, JSON, and JDBC.
- Compatibility with Hive.
- Standard Connectivity: JDBC/ODBC.
- Performance and Scalability.
- User-Defined Functions (UDFs).
- RDDs:
- Resilient Distributed Datasets (RDDs) are distributed memory abstractions that let programmers perform in-memory computations on large clusters in a fault-tolerant manner.
- RDDs can be created from any data source. Eg: Scala collection, local file system, Hadoop, Amazon S3, HBase Table, etc.
- RDDs support two types of operations:
- Transformations:
- Transformations are operations that are applied to RDDs to create a new RDD.
- Examples: map, filter, join, union, reduceByKey, and groupByKey, etc.
- Transformations are lazy, meaning they do not compute their results right away; they only compute their results when an action is called.
- Actions:
- Actions are operations that are applied on RDDs to instruct Spark to perform computation and send the result back to the driver.
- Results are usually a value (or a file, but not another RDD).
- Examples: count, collect, reduce, and saveAsTextFile, etc.
- Transformations:
The Real-Time Big Data Processing Framework: Advantages and Limitations 5¶
- Processing frameworks perform computing over the data in the system, either by reading from non-volatile storage or as it is ingested into the system.
- Computing over data is the process of extracting information and insight from large quantities of individual data points.
- Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine.
- Engines and frameworks can often be swapped out or used in tandem.
- Apache Spark is another processing framework that can be hooked up to Hadoop replacing the MapReduce engine.
- This interoperability between components is one reason that big data systems have great flexibility.
Batch Processing Systems¶
- Batch processing:
- It involves operating over a large, static dataset and returning the result at a later time when the computation is complete.
- It defines processes that run on the entire dataset.
- It has been the traditional way of processing data.
- It is well-suited for calculations where access to a complete set of records is required.
- Examples: when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records.
- It is not well-suited where processing time is of significant concern.
- It is usually the choice for mining historical data for insights.
- Datasets ideal for batch processing:
- Bounded: finite in size.
- Persistent: stored in a non-volatile storage system.
- Large: too large to fit in memory.
- Apache Hadoop:
- It is the default batch processing system.
- Batch Processing Model:
- MapReduce’s processing technique follows the map, shuffle, reduce algorithm using key-value pairs.
- Procedure:
- Read the dataset from HDFS.
- Divide the dataset into smaller chunks and distribute them across the cluster nodes.
- Apply the computation to each chunk through a node (map function); intermediate results are stored back to HDFS.
- Shuffle and sort the intermediate results.
- Reduce the intermediate results to a final result by combining them according to the key.
- Write the final result back to HDFS.
- Advantages:
- Relying on disk storage allows for the processing of datasets that are too large to fit in memory.
- Disk is usually cheaper than memory, which makes it more cost-effective.
- Scalability: adding more nodes to the cluster increases the processing power.
- Community support, especially as it has been around for a long time.
- Disadvantages:
- Using the disk increases the processing time.
- MapReduce concepts are hard to learn and use.
Stream Processing Systems¶
- Stream processing:
- It does the computation on each record as it enters the system.
- It defines processes that run on individual records.
- Its processing is event-based; it never ends until explicitly stopped.
- Its results are immediately available; and will be updated as new records are processed.
- It is well-suited for functional processing where the computation is applied to each record individually and is almost always stateless.
- It is well-suited for near real-time processing.
- Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions.
- It is a good fit for data where you must respond to changes or spikes and where you’re interested in trends over time.
- Datasets ideal for stream processing:
- Unbounded: infinite:
- The total dataset is the number of records entered into the system so far.
- The working dataset is the number of records entered into the system since the last computation:
- 1 record: True stream processing.
- few records: Micro-batch processing.
- Unbounded: infinite:
Apache Storm¶
- It is a stream processing framework that focuses on extremely low latency.
- It is the best option for workloads that require near real-time processing.
- It can handle very large quantities of data and deliver results with less latency than other solutions.
- Stream processing works by orchestrating DAGs (Directed Acyclic Graphs) in a framework it calls topologies.
- Processing model:
- Streams: the data that flows through the topology; unbounded data that are continuously arriving at the system.
- Spouts: the source of the data; they read data from an external source and emit it into the topology.
- Bolts: the processing units; receive data from spouts or other bolts, process it, and emit it to other bolts or sinks.
- By default, Storm offers a processing guarantee at least once, meaning that it can guarantee that each message is processed at least once, but there may be duplicates in some failure scenarios.
- Storm does not guarantee that messages will be processed in order.
- In order to achieve exactly-once, stateful processing, an abstraction called Trident is also available.
- To be explicit, Apache Storm without Trident is often referred to as Core Storm.
- Trident significantly alters the processing dynamics of Storm, increasing latency, adding a state to the processing, and implementing a micro-batching model instead of an item-by-item pure streaming system.
- Storm Advantages:
- The best solution for near real-time processing.
- Extremely low latency.
- It is useful when processing time is important, such as when the result is sent to the user immediately.
- Flexibility: Storm with Trident gives you the option to use micro-batches instead of pure stream processing.
- Storm is compatible with Hadoop, which allows you to use both systems together.
- It has wide language support.
- Storm Disadvantages:
- It can not guarantee that messages will be processed in order.
- At least once processing can lead to duplicates.
- Exactly once processing is available but at the cost of increased latency.
Apache Samza¶
- It is a stream processing framework that is tightly tied to the Apache Kafka messaging system.
- It integrates well with Kafka, thus taking advantage of its unique features and guarantees.
- It uses Kafka to provide fault tolerance, buffering, and state storage.
- Samza uses YARN for resource negotiation; thus, it can run on any Hadoop cluster, and Hadoop is required to run Samza.
- Because Kafka is represented by an immutable log, Samza deals with immutable streams.
- Processing model:
- Topics: the streams of data entering a Kafka system, where consumers can subscribe to them.
- Partitions:
- The way Kafka distributes data across the cluster.
- Each topic is divided into partitions.
- It is guaranteed that messages with the same key will always go to the same partition.
- Partitions have guaranteed order.
- Brokers: the individual nodes in the Kafka cluster.
- Producers: components that write to a Kafka topic; the producer provides the key for partitioning.
- Consumers: components that read from a Kafka topic.
- Samza Advantages:
- Tight integration with Kafka ensures guarantees that are not available in other systems.
- Kafka’s immutable log allows for fault tolerance and state storage.
- Low latency.
- Kafka’s multi-subscriber model allows for multiple consumers to read from the same topic.
- Intermediate results are sent to Kafka and thus can be consumed independently by downstream systems.
- Loose coupling with Kafka allows for easy scaling and multiple consumers without stressing the system.
- Samza disadvantages:
- It does not provide accurate recovery of aggregated state (like counts) in the event of a failure since data might be delivered more than once.
- It only supports JVM languages (Java, Scala).
Hybrid Processing Systems¶
- Some processing frameworks can handle both batch and stream workloads.
Apache Spark¶
- It is a next-generation batch-processing framework with stream-processing capabilities.
- Built using many of the same principles of Hadoop’s MapReduce engine, Spark focuses primarily on speeding up batch processing workloads by offering full in-memory computation and processing optimization.
- Spark can be deployed as a standalone cluster (if paired with a capable storage layer) or can hook into Hadoop as an alternative to the MapReduce engine.
- Batch Processing Model:
- Spark processes all data in memory, only interacting with the storage layer to initially load the data into memory and at the end to persist the final results.
- All intermediate results are managed in memory.
- Spark is also faster on disk-related tasks because of holistic optimization that can be achieved by analyzing the complete set of tasks ahead of time.
- Spark uses a model called Resilient Distributed Datasets, or RDDs, to work with data. These are immutable structures that exist within memory that represent collections of data.
- Stream Processing Model:
- Stream processing capabilities are supplied by Spark Streaming.
- Spark itself is designed with batch-oriented workloads in mind.
- It uses a micro-batching model to simulate stream processing.
- Spark Streaming works by buffering the stream in sub-second increments.
- These are sent as small fixed datasets for batch processing.
- Spark Advantages:
- It is faster than Hadoop.
- It can be deployed as a standalone cluster or integrated with an existing Hadoop cluster.
- It can handle both batch and stream processing.
- It has an ecosystem of libraries that can be used for machine learning, interactive queries, etc.
- Easy to use and learn compared to MapReduce.
- Spark Disadvantages:
- Buffering can increase latency.
- RAM is more expensive than disk storage, thus increasing costs.
- It uses more resources than Hadoop.
Apache Flink¶
- It is a stream processing framework that can also handle batch tasks.
- It considers batches to simply be data streams with finite boundaries and thus treats batch processing as a subset of stream processing.
- This stream-first approach to all processing has a number of interesting side effects.
- This stream-first approach has been called the Kappa architecture, in contrast to the more widely known Lambda architecture.
- Stream processing model:
- Flink’s stream processing model handles incoming data on an item-by-item basis as a true stream.
- Streams: immutable, unbounded datasets that flow through the system.
- Operators: the processing units that receive data from streams, process it, and emit it to other operators or sinks.
- Sources: the entry points for data into the system.
- Sinks: the exit points for data out of the system.
- Stream processing tasks take snapshots at set points during their computation to use for recovery in case of problems.
- Flink’s stream processing understands the concept of event time, meaning the time that the event occurred, and can handle sessions.
- This means that it can guarantee ordering and grouping in some interesting ways.
- Batch processing model:
- Flink treats batch processing as a subset of stream processing.
- It uses the same operators and concepts as stream processing.
- It treats a batch as a bounded stream.
- It can handle both batch and stream processing in the same system.
- Flink Advantages:
- Flink’s streamfirst approach offers low latency, high throughput, and real entry-by-entry processing.
- It is unique since it allows true stream processing.
- It has its own garbage collection system that allows for better memory management.
- It can automatically perform some optimizations (while it has to be done manually in Spark).
- It handles data partitioning and caching automatically.
- Flink offers a web-based scheduling view to easily manage tasks and view the system.
- Flink Disadvantages:
- It is still a very young project.
- Large-scale deployments in the wild are still not as common as other processing frameworks.
High-Efficient Fuzzy Querying with HiveQL for Big Data Warehousing 6¶
- Apache Hive is a data warehousing framework working on top of the Hadoop platform for big data processing.
- To overcome the challenges of big data changes have to be made that involve moving:
- From fixed-schema data warehouses to schema-less data lakes for storing data.
- From extract, transform, and load to extract, load, and transform procedures for data capturing and preparation.
- From schema on write to schema on read data analysis strategies.
- From scaling up to scaling out solutions for increasing the performance of analytic queries.
- Hive queries and analyzes big data stored in the HDFS using a Structured Query Language (SQL)-like query language, called HiveQL.
- FuzzyHive:
- It is a library residing on top of Hive and implements the benefits of using fuzzy logic in data processing, querying, and analysis.
- It supports elastic grouping and filtering on dimensional attributes.
- It supports fuzzy joining on attributes shared by many tables of a big data warehouse.
- It supports flexible ad hoc querying capabilities against the dynamically defined schema of a data warehouse built over a big data lake.
Video Resources 7 8 9¶
Apache Hive¶
- Hive allows you to write SQL-like queries that will be converted into MapReduce jobs and executed on the Hadoop cluster (Spark, Tez, etc.).
- Hive is not a database but a data warehousing framework; it can not be used for transactional processing, real-time processing, or online transaction processing.
- Hive supports many file formats, including text, sequence, ORC, and Parquet.
- Hive stores metadata in a relational database, such as MySQL, PostgreSQL, or Derby.
- Hive provides good compression and indexing capabilities.
- Supports user-defined functions (UDFs) and user-defined aggregates (UDAFs).
- Supports specialized joins.
- Hive vs RDBMS:
- Hive enforces schema on read; while RDMBs enforce a schema on write.
- Hive max storage of 100s of petabytes; RDBMS max storage of 10s of terabytes.
- Hive does not support OLTP; RDBMS supports OLTP.
References¶
-
Apache.com (2022). Built-in SQL Commands. Retrieved August 25, 2022 https://spark.apache.org/docs/latest/api/sql/index.html ↩
-
Quick Start. (2022). Apache spark,3.3.0. https://spark.apache.org/docs/latest/quick-start.html ↩
-
Damji, J.S., Wenig, B., Das,T., & Lee, D. (2020). Learning Spark (2nd Ed). O’Reilly Media, Inc. https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch04.html ↩
-
Dayananda, S. (2022, March 28). Spark SQL tutorial – Understanding Spark SQL with examples. Edureka. https://www.edureka.co/blog/spark-sql-tutorial/ ↩
-
Gurusamy, V., Kannan, S., & Nandhini, K. (2017). The real-time big data processing framework: Advantages and limitations. International Journal of Computer Sciences and Engineering, 5(12), 305-312. https://www.researchgate.net/profile/Vairaprakash-Gurusamy/publication/322550872_The_Real_Time_Big_Data_Processing_Framework_Advantages_and_Limitations/links/5a5f4658458515c03ee1d07c/The-Real-Time-Big-Data-Processing-Framework-Advantages-and-Limitations.pdf?_sg%5B0%5D=uwVJhtrrr3M7NIWrhvIs-UstxE9TJpysCbcZmRheshs4C7wYaf2wXw60_nBVeVXN4pimX1iEGMemSq2AnFnU2A.lhaUyDPsKIHsCSjF-je_WWD83to8fD-njoCBH-kRVpHA5r1iVi2HxQKrqHsTW8DUZXlxPAysqzM_MPkw0yepUQ&_sg%5B1%5D=02APsySDXJ2zV5Py7JWaQ3Mknh4SgUZDwNfLZx2rnRROkOIs4K5ZjZQJtTG02oWQTGNpSnpUDnVvxbhjt5krZVzjzNTTIt9bldDHRI14ZM68.lhaUyDPsKIHsCSjF-je_WWD83to8fD-njoCBH-kRVpHA5r1iVi2HxQKrqHsTW8DUZXlxPAysqzM_MPkw0yepUQ&_iepl= ↩
-
Małysiak-Mrozek, B., Wieszok, J., Pedrycz, W., Ding, W., & Mrozek, D. (2022). High-efficient fuzzy querying with HiveQL for big data warehousing. IEEE transactions on fuzzy Systems, 30(6), 1823-1837. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9388934 ↩
-
BigDataELearning. (2017, December 17). What is Apache Hive? : Understanding hive [Video]. YouTube. https://www.youtube.com/watch?v=cMziv1iYt28 ↩
-
Simplilearn. (2017, November 2). Big Data Tools and Technologies | Big Data Tools Tutorial | Big Data Training | Simplilearn [Video]. YouTube. https://youtu.be/Pyo4RWtxsQM?t=103 ↩
-
Computerphile. (2018, December 12). Apache spark- computerphile [Video]. YouTube. https://www.youtube.com/watch?v=tDVPcqGpEnM ↩