4. Querying Big Data¶

Introduction ¹¶

More Info:
- RDD Programming Guide - Spark 3.5.1 Documentation. (2024). Apache.org. https://spark.apache.org/docs/latest/rdd-programming-guide.html#using-the-shell
- Spark SQL and DataFrames - Spark 3.5.1 Documentation. (2024). Apache.org. https://spark.apache.org/docs/latest/sql-programming-guide.html
- PySpark Overview — PySpark master documentation. (2024). Apache.org. https://spark.apache.org/docs/latest/api/python/index.html#pyspark.sql.DataFrame
- Cluster Mode Overview - Spark 3.5.1 Documentation. (2024). Apache.org. https://spark.apache.org/docs/latest/cluster-overview.html

Apache Spark: Quick Start ²¶

Before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD).
After Spark 2.0, RDDs are replaced by Dataset, which is strongly typed like an RDD, but with richer optimizations under the hood.
Dataset:
- A Dataset is a distributed collection of data.
- Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
- Dataset API is available in Scala and Java only.
- Python’s dynamic nature makes it easy to work without a dedicated API.
Dataframe:
- A Dataframe is a Dataset organized into named columns.
- It is conceptually equivalent to a table in a relational database or a data frame in R/Python.
- Dataframes can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.
- Dataframes are available in 4 languages: Scala, Java, Python, and R.

Spark SQL and DataFrames: Introduction to Built-in Data Sources ³¶

All SQL queries can be expressed with an equivalent DataFrame API query. For example, the first query can be expressed in the Python DataFrame API as:

# SQL query
spark.sql("""SELECT distance, origin, destination
    FROM us_delay_flights_tbl WHERE distance > 1000
    ORDER BY distance DESC""").show(10)

# DataFrame API query
from pyspark.sql.functions import col, desc
(df.select("distance", "origin", "destination")
  .where(col("distance") > 1000)
  .orderBy(desc("distance"))).show(10)

Instead of having a separate metaStore for Spark tables, Spark by default uses the Apache Hive metaStore, located at /user/hive/warehouse, to persist all the metadata about your tables.
Spark allows you to create two types of tables: managed and unmanaged:
- Managed tables:
  - Spark manages the lifecycle of the table, including the data and metadata.
  - Data can be stored in a local file system, HDFS, or an object store such as Amazon S3 or Azure Blob Storage.
  - Spark can do anything on both the metadata and the data.
- Unmanaged tables:
  - Spark manages the metadata, but the data is stored on an external data source.
  - External data sources can include Cassandra.
  - Spark can do anything on the metadata, but not the data itself.
  - Spark does not own the data and cannot delete it or modify it, but it can read it.
Tables reside within a database. By default, Spark creates tables under the default database.

Spark SQL Tutorial ⁴¶

Spark supports querying data either via SQL or via the Hive Query Language.
Spark SQL integrates relational processing with Spark’s functional programming.
SQL was built to overcome these drawbacks and replace Apache Hive.
Spark SQL is faster than Hive when it comes to processing speed.
Limitations With Hive:
- Hive launches multiple MapReduce jobs to execute a single query. MapReduce jobs are slow and lag in performance.
- Hive has no resume capability. If a job fails, it has to be restarted from the beginning.
- Hive cannot drop encrypted databases in cascade mode when the trash is enabled. The purge option must be used which skips the trash.
Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.
With Spark SQL, Apache Spark is accessible to more users and improves optimization for the current ones.
Spark SQL provides DataFrame APIs that perform relational operations on both external data sources and Spark’s built-in distributed collections.
It introduces an extensible optimizer called Catalyst as it helps in supporting a wide range of data sources and algorithms in big data.
Spark SQL Libraries:
- Data Source API: It is a pluggable API that allows Spark to access structured data through Spark SQL.
- Dataframe API: It is a distributed collection of data organized into named columns.
- SQL Interpreter and Optimizer: It is a module that is used for structured data processing.
- SQL Service: It is a service that is used to execute SQL queries.
Features Of Spark SQL:
- Integration with Spark.
- Uniform Data Access: like Hive, Avro, Parquet, ORC, JSON, and JDBC.
- Compatibility with Hive.
- Standard Connectivity: JDBC/ODBC.
- Performance and Scalability.
- User-Defined Functions (UDFs).
RDDs:
- Resilient Distributed Datasets (RDDs) are distributed memory abstractions that let programmers perform in-memory computations on large clusters in a fault-tolerant manner.
- RDDs can be created from any data source. Eg: Scala collection, local file system, Hadoop, Amazon S3, HBase Table, etc.
RDDs support two types of operations:
- Transformations:
  - Transformations are operations that are applied to RDDs to create a new RDD.
  - Examples: map, filter, join, union, reduceByKey, and groupByKey, etc.
  - Transformations are lazy, meaning they do not compute their results right away; they only compute their results when an action is called.
- Actions:
  - Actions are operations that are applied on RDDs to instruct Spark to perform computation and send the result back to the driver.
  - Results are usually a value (or a file, but not another RDD).
  - Examples: count, collect, reduce, and saveAsTextFile, etc.

The Real-Time Big Data Processing Framework: Advantages and Limitations ⁵¶

Processing frameworks perform computing over the data in the system, either by reading from non-volatile storage or as it is ingested into the system.
Computing over data is the process of extracting information and insight from large quantities of individual data points.
Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine.
Engines and frameworks can often be swapped out or used in tandem.
Apache Spark is another processing framework that can be hooked up to Hadoop replacing the MapReduce engine.
This interoperability between components is one reason that big data systems have great flexibility.

Batch Processing Systems¶

Batch processing:
- It involves operating over a large, static dataset and returning the result at a later time when the computation is complete.
- It defines processes that run on the entire dataset.
- It has been the traditional way of processing data.
- It is well-suited for calculations where access to a complete set of records is required.
- Examples: when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records.
- It is not well-suited where processing time is of significant concern.
- It is usually the choice for mining historical data for insights.
Datasets ideal for batch processing:
- Bounded: finite in size.
- Persistent: stored in a non-volatile storage system.
- Large: too large to fit in memory.
Apache Hadoop:
- It is the default batch processing system.
- Batch Processing Model:
  - MapReduce’s processing technique follows the map, shuffle, reduce algorithm using key-value pairs.
  - Procedure:
    - Read the dataset from HDFS.
    - Divide the dataset into smaller chunks and distribute them across the cluster nodes.
    - Apply the computation to each chunk through a node (map function); intermediate results are stored back to HDFS.
    - Shuffle and sort the intermediate results.
    - Reduce the intermediate results to a final result by combining them according to the key.
    - Write the final result back to HDFS.
- Advantages:
  - Relying on disk storage allows for the processing of datasets that are too large to fit in memory.
  - Disk is usually cheaper than memory, which makes it more cost-effective.
  - Scalability: adding more nodes to the cluster increases the processing power.
  - Community support, especially as it has been around for a long time.
- Disadvantages:
  - Using the disk increases the processing time.
  - MapReduce concepts are hard to learn and use.

Stream Processing Systems¶

Stream processing:
- It does the computation on each record as it enters the system.
- It defines processes that run on individual records.
- Its processing is event-based; it never ends until explicitly stopped.
- Its results are immediately available; and will be updated as new records are processed.
- It is well-suited for functional processing where the computation is applied to each record individually and is almost always stateless.
- It is well-suited for near real-time processing.
- Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions.
- It is a good fit for data where you must respond to changes or spikes and where you’re interested in trends over time.
Datasets ideal for stream processing:
- Unbounded: infinite:
  - The total dataset is the number of records entered into the system so far.
  - The working dataset is the number of records entered into the system since the last computation:
    - 1 record: True stream processing.
    - few records: Micro-batch processing.

Apache Storm¶

It is a stream processing framework that focuses on extremely low latency.
It is the best option for workloads that require near real-time processing.
It can handle very large quantities of data and deliver results with less latency than other solutions.
Stream processing works by orchestrating DAGs (Directed Acyclic Graphs) in a framework it calls topologies.
Processing model:
- Streams: the data that flows through the topology; unbounded data that are continuously arriving at the system.
- Spouts: the source of the data; they read data from an external source and emit it into the topology.
- Bolts: the processing units; receive data from spouts or other bolts, process it, and emit it to other bolts or sinks.
By default, Storm offers a processing guarantee at least once, meaning that it can guarantee that each message is processed at least once, but there may be duplicates in some failure scenarios.
Storm does not guarantee that messages will be processed in order.
In order to achieve exactly-once, stateful processing, an abstraction called Trident is also available.
To be explicit, Apache Storm without Trident is often referred to as Core Storm.
Trident significantly alters the processing dynamics of Storm, increasing latency, adding a state to the processing, and implementing a micro-batching model instead of an item-by-item pure streaming system.
Storm Advantages:
- The best solution for near real-time processing.
- Extremely low latency.
- It is useful when processing time is important, such as when the result is sent to the user immediately.
- Flexibility: Storm with Trident gives you the option to use micro-batches instead of pure stream processing.
- Storm is compatible with Hadoop, which allows you to use both systems together.
- It has wide language support.
Storm Disadvantages:
- It can not guarantee that messages will be processed in order.
- At least once processing can lead to duplicates.
- Exactly once processing is available but at the cost of increased latency.

Apache Samza¶

It is a stream processing framework that is tightly tied to the Apache Kafka messaging system.
It integrates well with Kafka, thus taking advantage of its unique features and guarantees.
It uses Kafka to provide fault tolerance, buffering, and state storage.
Samza uses YARN for resource negotiation; thus, it can run on any Hadoop cluster, and Hadoop is required to run Samza.
Because Kafka is represented by an immutable log, Samza deals with immutable streams.
Processing model:
- Topics: the streams of data entering a Kafka system, where consumers can subscribe to them.
- Partitions:
  - The way Kafka distributes data across the cluster.
  - Each topic is divided into partitions.
  - It is guaranteed that messages with the same key will always go to the same partition.
  - Partitions have guaranteed order.
- Brokers: the individual nodes in the Kafka cluster.
- Producers: components that write to a Kafka topic; the producer provides the key for partitioning.
- Consumers: components that read from a Kafka topic.
Samza Advantages:
- Tight integration with Kafka ensures guarantees that are not available in other systems.
- Kafka’s immutable log allows for fault tolerance and state storage.
- Low latency.
- Kafka’s multi-subscriber model allows for multiple consumers to read from the same topic.
- Intermediate results are sent to Kafka and thus can be consumed independently by downstream systems.
- Loose coupling with Kafka allows for easy scaling and multiple consumers without stressing the system.
Samza disadvantages:
- It does not provide accurate recovery of aggregated state (like counts) in the event of a failure since data might be delivered more than once.
- It only supports JVM languages (Java, Scala).

Hybrid Processing Systems¶

Some processing frameworks can handle both batch and stream workloads.

Apache Spark¶

It is a next-generation batch-processing framework with stream-processing capabilities.
Built using many of the same principles of Hadoop’s MapReduce engine, Spark focuses primarily on speeding up batch processing workloads by offering full in-memory computation and processing optimization.
Spark can be deployed as a standalone cluster (if paired with a capable storage layer) or can hook into Hadoop as an alternative to the MapReduce engine.
Batch Processing Model:
- Spark processes all data in memory, only interacting with the storage layer to initially load the data into memory and at the end to persist the final results.
- All intermediate results are managed in memory.
- Spark is also faster on disk-related tasks because of holistic optimization that can be achieved by analyzing the complete set of tasks ahead of time.
- Spark uses a model called Resilient Distributed Datasets, or RDDs, to work with data. These are immutable structures that exist within memory that represent collections of data.
Stream Processing Model:
- Stream processing capabilities are supplied by Spark Streaming.
- Spark itself is designed with batch-oriented workloads in mind.
- It uses a micro-batching model to simulate stream processing.
- Spark Streaming works by buffering the stream in sub-second increments.
- These are sent as small fixed datasets for batch processing.
Spark Advantages:
- It is faster than Hadoop.
- It can be deployed as a standalone cluster or integrated with an existing Hadoop cluster.
- It can handle both batch and stream processing.
- It has an ecosystem of libraries that can be used for machine learning, interactive queries, etc.
- Easy to use and learn compared to MapReduce.
Spark Disadvantages:
- Buffering can increase latency.
- RAM is more expensive than disk storage, thus increasing costs.
- It uses more resources than Hadoop.

Apache Flink¶

It is a stream processing framework that can also handle batch tasks.
It considers batches to simply be data streams with finite boundaries and thus treats batch processing as a subset of stream processing.
This stream-first approach to all processing has a number of interesting side effects.
This stream-first approach has been called the Kappa architecture, in contrast to the more widely known Lambda architecture.
Stream processing model:
- Flink’s stream processing model handles incoming data on an item-by-item basis as a true stream.
- Streams: immutable, unbounded datasets that flow through the system.
- Operators: the processing units that receive data from streams, process it, and emit it to other operators or sinks.
- Sources: the entry points for data into the system.
- Sinks: the exit points for data out of the system.
- Stream processing tasks take snapshots at set points during their computation to use for recovery in case of problems.
- Flink’s stream processing understands the concept of event time, meaning the time that the event occurred, and can handle sessions.
- This means that it can guarantee ordering and grouping in some interesting ways.
Batch processing model:
- Flink treats batch processing as a subset of stream processing.
- It uses the same operators and concepts as stream processing.
- It treats a batch as a bounded stream.
- It can handle both batch and stream processing in the same system.
Flink Advantages:
- Flink’s streamfirst approach offers low latency, high throughput, and real entry-by-entry processing.
- It is unique since it allows true stream processing.
- It has its own garbage collection system that allows for better memory management.
- It can automatically perform some optimizations (while it has to be done manually in Spark).
- It handles data partitioning and caching automatically.
- Flink offers a web-based scheduling view to easily manage tasks and view the system.
Flink Disadvantages:
- It is still a very young project.
- Large-scale deployments in the wild are still not as common as other processing frameworks.

High-Efficient Fuzzy Querying with HiveQL for Big Data Warehousing ⁶¶

Apache Hive is a data warehousing framework working on top of the Hadoop platform for big data processing.
To overcome the challenges of big data changes have to be made that involve moving:
- From fixed-schema data warehouses to schema-less data lakes for storing data.
- From extract, transform, and load to extract, load, and transform procedures for data capturing and preparation.
- From schema on write to schema on read data analysis strategies.
- From scaling up to scaling out solutions for increasing the performance of analytic queries.
Hive queries and analyzes big data stored in the HDFS using a Structured Query Language (SQL)-like query language, called HiveQL.
FuzzyHive:
- It is a library residing on top of Hive and implements the benefits of using fuzzy logic in data processing, querying, and analysis.
- It supports elastic grouping and filtering on dimensional attributes.
- It supports fuzzy joining on attributes shared by many tables of a big data warehouse.
- It supports flexible ad hoc querying capabilities against the dynamically defined schema of a data warehouse built over a big data lake.

Video Resources ⁷ ⁸ ⁹¶

Apache Hive¶

Hive allows you to write SQL-like queries that will be converted into MapReduce jobs and executed on the Hadoop cluster (Spark, Tez, etc.).
Hive is not a database but a data warehousing framework; it can not be used for transactional processing, real-time processing, or online transaction processing.
Hive supports many file formats, including text, sequence, ORC, and Parquet.
Hive stores metadata in a relational database, such as MySQL, PostgreSQL, or Derby.
Hive provides good compression and indexing capabilities.
Supports user-defined functions (UDFs) and user-defined aggregates (UDAFs).
Supports specialized joins.
Hive vs RDBMS:
- Hive enforces schema on read; while RDMBs enforce a schema on write.
- Hive max storage of 100s of petabytes; RDBMS max storage of 10s of terabytes.
- Hive does not support OLTP; RDBMS supports OLTP.

References¶

Apache.com (2022). Built-in SQL Commands. Retrieved August 25, 2022 https://spark.apache.org/docs/latest/api/sql/index.html ↩
Quick Start. (2022). Apache spark,3.3.0. https://spark.apache.org/docs/latest/quick-start.html ↩
Damji, J.S., Wenig, B., Das,T., & Lee, D. (2020). Learning Spark (2^nd Ed). O’Reilly Media, Inc. https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch04.html ↩
Dayananda, S. (2022, March 28). Spark SQL tutorial – Understanding Spark SQL with examples. Edureka. https://www.edureka.co/blog/spark-sql-tutorial/ ↩
Gurusamy, V., Kannan, S., & Nandhini, K. (2017). The real-time big data processing framework: Advantages and limitations. International Journal of Computer Sciences and Engineering, 5(12), 305-312. https://www.researchgate.net/profile/Vairaprakash-Gurusamy/publication/322550872_The_Real_Time_Big_Data_Processing_Framework_Advantages_and_Limitations/links/5a5f4658458515c03ee1d07c/The-Real-Time-Big-Data-Processing-Framework-Advantages-and-Limitations.pdf?_sg%5B0%5D=uwVJhtrrr3M7NIWrhvIs-UstxE9TJpysCbcZmRheshs4C7wYaf2wXw60_nBVeVXN4pimX1iEGMemSq2AnFnU2A.lhaUyDPsKIHsCSjF-je_WWD83to8fD-njoCBH-kRVpHA5r1iVi2HxQKrqHsTW8DUZXlxPAysqzM_MPkw0yepUQ&_sg%5B1%5D=02APsySDXJ2zV5Py7JWaQ3Mknh4SgUZDwNfLZx2rnRROkOIs4K5ZjZQJtTG02oWQTGNpSnpUDnVvxbhjt5krZVzjzNTTIt9bldDHRI14ZM68.lhaUyDPsKIHsCSjF-je_WWD83to8fD-njoCBH-kRVpHA5r1iVi2HxQKrqHsTW8DUZXlxPAysqzM_MPkw0yepUQ&_iepl= ↩
Małysiak-Mrozek, B., Wieszok, J., Pedrycz, W., Ding, W., & Mrozek, D. (2022). High-efficient fuzzy querying with HiveQL for big data warehousing. IEEE transactions on fuzzy Systems, 30(6), 1823-1837. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9388934 ↩
BigDataELearning. (2017, December 17). What is Apache Hive? : Understanding hive [Video]. YouTube. https://www.youtube.com/watch?v=cMziv1iYt28 ↩
Simplilearn. (2017, November 2). Big Data Tools and Technologies | Big Data Tools Tutorial | Big Data Training | Simplilearn [Video]. YouTube. https://youtu.be/Pyo4RWtxsQM?t=103 ↩
Computerphile. (2018, December 12). Apache spark- computerphile [Video]. YouTube. https://www.youtube.com/watch?v=tDVPcqGpEnM ↩

4. Querying Big Data¶

Introduction 1¶

Apache Spark: Quick Start 2¶

Spark SQL and DataFrames: Introduction to Built-in Data Sources 3¶

Spark SQL Tutorial 4¶

The Real-Time Big Data Processing Framework: Advantages and Limitations 5¶

Batch Processing Systems¶

Stream Processing Systems¶

Apache Storm¶

Apache Samza¶

Hybrid Processing Systems¶

Apache Spark¶

Apache Flink¶

High-Efficient Fuzzy Querying with HiveQL for Big Data Warehousing 6¶

Video Resources 7 8 9¶

Apache Hive¶

References¶

Introduction ¹¶

Apache Spark: Quick Start ²¶

Spark SQL and DataFrames: Introduction to Built-in Data Sources ³¶

Spark SQL Tutorial ⁴¶

The Real-Time Big Data Processing Framework: Advantages and Limitations ⁵¶

High-Efficient Fuzzy Querying with HiveQL for Big Data Warehousing ⁶¶

Video Resources ⁷ ⁸ ⁹¶