Skip to content

4. Querying Big Data

Introduction 1

Apache Spark: Quick Start 2

  • Before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD).
  • After Spark 2.0, RDDs are replaced by Dataset, which is strongly typed like an RDD, but with richer optimizations under the hood.
  • Dataset:
    • A Dataset is a distributed collection of data.
    • Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
    • Dataset API is available in Scala and Java only.
    • Python’s dynamic nature makes it easy to work without a dedicated API.
  • Dataframe:
    • A Dataframe is a Dataset organized into named columns.
    • It is conceptually equivalent to a table in a relational database or a data frame in R/Python.
    • Dataframes can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.
    • Dataframes are available in 4 languages: Scala, Java, Python, and R.

Spark SQL and DataFrames: Introduction to Built-in Data Sources 3

  • All SQL queries can be expressed with an equivalent DataFrame API query. For example, the first query can be expressed in the Python DataFrame API as:
# SQL query
spark.sql("""SELECT distance, origin, destination
    FROM us_delay_flights_tbl WHERE distance > 1000
    ORDER BY distance DESC""").show(10)

# DataFrame API query
from pyspark.sql.functions import col, desc
(df.select("distance", "origin", "destination")
  .where(col("distance") > 1000)
  .orderBy(desc("distance"))).show(10)
  • Instead of having a separate metaStore for Spark tables, Spark by default uses the Apache Hive metaStore, located at /user/hive/warehouse, to persist all the metadata about your tables.
  • Spark allows you to create two types of tables: managed and unmanaged:
    • Managed tables:
      • Spark manages the lifecycle of the table, including the data and metadata.
      • Data can be stored in a local file system, HDFS, or an object store such as Amazon S3 or Azure Blob Storage.
      • Spark can do anything on both the metadata and the data.
    • Unmanaged tables:
      • Spark manages the metadata, but the data is stored on an external data source.
      • External data sources can include Cassandra.
      • Spark can do anything on the metadata, but not the data itself.
      • Spark does not own the data and cannot delete it or modify it, but it can read it.
  • Tables reside within a database. By default, Spark creates tables under the default database.

Spark SQL Tutorial 4

  • Spark supports querying data either via SQL or via the Hive Query Language.
  • Spark SQL integrates relational processing with Spark’s functional programming.
  • SQL was built to overcome these drawbacks and replace Apache Hive.
  • Spark SQL is faster than Hive when it comes to processing speed.
  • Limitations With Hive:
    • Hive launches multiple MapReduce jobs to execute a single query. MapReduce jobs are slow and lag in performance.
    • Hive has no resume capability. If a job fails, it has to be restarted from the beginning.
    • Hive cannot drop encrypted databases in cascade mode when the trash is enabled. The purge option must be used which skips the trash.
  • Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.
  • With Spark SQL, Apache Spark is accessible to more users and improves optimization for the current ones.
  • Spark SQL provides DataFrame APIs that perform relational operations on both external data sources and Spark’s built-in distributed collections.
  • It introduces an extensible optimizer called Catalyst as it helps in supporting a wide range of data sources and algorithms in big data.
  • Spark SQL Libraries:
    • Data Source API: It is a pluggable API that allows Spark to access structured data through Spark SQL.
    • Dataframe API: It is a distributed collection of data organized into named columns.
    • SQL Interpreter and Optimizer: It is a module that is used for structured data processing.
    • SQL Service: It is a service that is used to execute SQL queries.
  • Features Of Spark SQL:
    • Integration with Spark.
    • Uniform Data Access: like Hive, Avro, Parquet, ORC, JSON, and JDBC.
    • Compatibility with Hive.
    • Standard Connectivity: JDBC/ODBC.
    • Performance and Scalability.
    • User-Defined Functions (UDFs).
  • RDDs:
    • Resilient Distributed Datasets (RDDs) are distributed memory abstractions that let programmers perform in-memory computations on large clusters in a fault-tolerant manner.
    • RDDs can be created from any data source. Eg: Scala collection, local file system, Hadoop, Amazon S3, HBase Table, etc.
  • RDDs support two types of operations:
    • Transformations:
      • Transformations are operations that are applied to RDDs to create a new RDD.
      • Examples: map, filter, join, union, reduceByKey, and groupByKey, etc.
      • Transformations are lazy, meaning they do not compute their results right away; they only compute their results when an action is called.
    • Actions:
      • Actions are operations that are applied on RDDs to instruct Spark to perform computation and send the result back to the driver.
      • Results are usually a value (or a file, but not another RDD).
      • Examples: count, collect, reduce, and saveAsTextFile, etc.

The Real-Time Big Data Processing Framework: Advantages and Limitations 5

  • Processing frameworks perform computing over the data in the system, either by reading from non-volatile storage or as it is ingested into the system.
  • Computing over data is the process of extracting information and insight from large quantities of individual data points.
  • Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine.
  • Engines and frameworks can often be swapped out or used in tandem.
  • Apache Spark is another processing framework that can be hooked up to Hadoop replacing the MapReduce engine.
  • This interoperability between components is one reason that big data systems have great flexibility.

Batch Processing Systems

  • Batch processing:
    • It involves operating over a large, static dataset and returning the result at a later time when the computation is complete.
    • It defines processes that run on the entire dataset.
    • It has been the traditional way of processing data.
    • It is well-suited for calculations where access to a complete set of records is required.
    • Examples: when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records.
    • It is not well-suited where processing time is of significant concern.
    • It is usually the choice for mining historical data for insights.
  • Datasets ideal for batch processing:
    • Bounded: finite in size.
    • Persistent: stored in a non-volatile storage system.
    • Large: too large to fit in memory.
  • Apache Hadoop:
    • It is the default batch processing system.
    • Batch Processing Model:
      • MapReduce’s processing technique follows the map, shuffle, reduce algorithm using key-value pairs.
      • Procedure:
        • Read the dataset from HDFS.
        • Divide the dataset into smaller chunks and distribute them across the cluster nodes.
        • Apply the computation to each chunk through a node (map function); intermediate results are stored back to HDFS.
        • Shuffle and sort the intermediate results.
        • Reduce the intermediate results to a final result by combining them according to the key.
        • Write the final result back to HDFS.
    • Advantages:
      • Relying on disk storage allows for the processing of datasets that are too large to fit in memory.
      • Disk is usually cheaper than memory, which makes it more cost-effective.
      • Scalability: adding more nodes to the cluster increases the processing power.
      • Community support, especially as it has been around for a long time.
    • Disadvantages:
      • Using the disk increases the processing time.
      • MapReduce concepts are hard to learn and use.

Stream Processing Systems

  • Stream processing:
    • It does the computation on each record as it enters the system.
    • It defines processes that run on individual records.
    • Its processing is event-based; it never ends until explicitly stopped.
    • Its results are immediately available; and will be updated as new records are processed.
    • It is well-suited for functional processing where the computation is applied to each record individually and is almost always stateless.
    • It is well-suited for near real-time processing.
    • Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions.
    • It is a good fit for data where you must respond to changes or spikes and where you’re interested in trends over time.
  • Datasets ideal for stream processing:
    • Unbounded: infinite:
      • The total dataset is the number of records entered into the system so far.
      • The working dataset is the number of records entered into the system since the last computation:
        • 1 record: True stream processing.
        • few records: Micro-batch processing.

Apache Storm

  • It is a stream processing framework that focuses on extremely low latency.
  • It is the best option for workloads that require near real-time processing.
  • It can handle very large quantities of data and deliver results with less latency than other solutions.
  • Stream processing works by orchestrating DAGs (Directed Acyclic Graphs) in a framework it calls topologies.
  • Processing model:
    • Streams: the data that flows through the topology; unbounded data that are continuously arriving at the system.
    • Spouts: the source of the data; they read data from an external source and emit it into the topology.
    • Bolts: the processing units; receive data from spouts or other bolts, process it, and emit it to other bolts or sinks.
  • By default, Storm offers a processing guarantee at least once, meaning that it can guarantee that each message is processed at least once, but there may be duplicates in some failure scenarios.
  • Storm does not guarantee that messages will be processed in order.
  • In order to achieve exactly-once, stateful processing, an abstraction called Trident is also available.
  • To be explicit, Apache Storm without Trident is often referred to as Core Storm.
  • Trident significantly alters the processing dynamics of Storm, increasing latency, adding a state to the processing, and implementing a micro-batching model instead of an item-by-item pure streaming system.
  • Storm Advantages:
    • The best solution for near real-time processing.
    • Extremely low latency.
    • It is useful when processing time is important, such as when the result is sent to the user immediately.
    • Flexibility: Storm with Trident gives you the option to use micro-batches instead of pure stream processing.
    • Storm is compatible with Hadoop, which allows you to use both systems together.
    • It has wide language support.
  • Storm Disadvantages:
    • It can not guarantee that messages will be processed in order.
    • At least once processing can lead to duplicates.
    • Exactly once processing is available but at the cost of increased latency.

Apache Samza

  • It is a stream processing framework that is tightly tied to the Apache Kafka messaging system.
  • It integrates well with Kafka, thus taking advantage of its unique features and guarantees.
  • It uses Kafka to provide fault tolerance, buffering, and state storage.
  • Samza uses YARN for resource negotiation; thus, it can run on any Hadoop cluster, and Hadoop is required to run Samza.
  • Because Kafka is represented by an immutable log, Samza deals with immutable streams.
  • Processing model:
    • Topics: the streams of data entering a Kafka system, where consumers can subscribe to them.
    • Partitions:
      • The way Kafka distributes data across the cluster.
      • Each topic is divided into partitions.
      • It is guaranteed that messages with the same key will always go to the same partition.
      • Partitions have guaranteed order.
    • Brokers: the individual nodes in the Kafka cluster.
    • Producers: components that write to a Kafka topic; the producer provides the key for partitioning.
    • Consumers: components that read from a Kafka topic.
  • Samza Advantages:
    • Tight integration with Kafka ensures guarantees that are not available in other systems.
    • Kafka’s immutable log allows for fault tolerance and state storage.
    • Low latency.
    • Kafka’s multi-subscriber model allows for multiple consumers to read from the same topic.
    • Intermediate results are sent to Kafka and thus can be consumed independently by downstream systems.
    • Loose coupling with Kafka allows for easy scaling and multiple consumers without stressing the system.
  • Samza disadvantages:
    • It does not provide accurate recovery of aggregated state (like counts) in the event of a failure since data might be delivered more than once.
    • It only supports JVM languages (Java, Scala).

Hybrid Processing Systems

  • Some processing frameworks can handle both batch and stream workloads.

Apache Spark

  • It is a next-generation batch-processing framework with stream-processing capabilities.
  • Built using many of the same principles of Hadoop’s MapReduce engine, Spark focuses primarily on speeding up batch processing workloads by offering full in-memory computation and processing optimization.
  • Spark can be deployed as a standalone cluster (if paired with a capable storage layer) or can hook into Hadoop as an alternative to the MapReduce engine.
  • Batch Processing Model:
    • Spark processes all data in memory, only interacting with the storage layer to initially load the data into memory and at the end to persist the final results.
    • All intermediate results are managed in memory.
    • Spark is also faster on disk-related tasks because of holistic optimization that can be achieved by analyzing the complete set of tasks ahead of time.
    • Spark uses a model called Resilient Distributed Datasets, or RDDs, to work with data. These are immutable structures that exist within memory that represent collections of data.
  • Stream Processing Model:
    • Stream processing capabilities are supplied by Spark Streaming.
    • Spark itself is designed with batch-oriented workloads in mind.
    • It uses a micro-batching model to simulate stream processing.
    • Spark Streaming works by buffering the stream in sub-second increments.
    • These are sent as small fixed datasets for batch processing.
  • Spark Advantages:
    • It is faster than Hadoop.
    • It can be deployed as a standalone cluster or integrated with an existing Hadoop cluster.
    • It can handle both batch and stream processing.
    • It has an ecosystem of libraries that can be used for machine learning, interactive queries, etc.
    • Easy to use and learn compared to MapReduce.
  • Spark Disadvantages:
    • Buffering can increase latency.
    • RAM is more expensive than disk storage, thus increasing costs.
    • It uses more resources than Hadoop.
  • It is a stream processing framework that can also handle batch tasks.
  • It considers batches to simply be data streams with finite boundaries and thus treats batch processing as a subset of stream processing.
  • This stream-first approach to all processing has a number of interesting side effects.
  • This stream-first approach has been called the Kappa architecture, in contrast to the more widely known Lambda architecture.
  • Stream processing model:
    • Flink’s stream processing model handles incoming data on an item-by-item basis as a true stream.
    • Streams: immutable, unbounded datasets that flow through the system.
    • Operators: the processing units that receive data from streams, process it, and emit it to other operators or sinks.
    • Sources: the entry points for data into the system.
    • Sinks: the exit points for data out of the system.
    • Stream processing tasks take snapshots at set points during their computation to use for recovery in case of problems.
    • Flink’s stream processing understands the concept of event time, meaning the time that the event occurred, and can handle sessions.
    • This means that it can guarantee ordering and grouping in some interesting ways.
  • Batch processing model:
    • Flink treats batch processing as a subset of stream processing.
    • It uses the same operators and concepts as stream processing.
    • It treats a batch as a bounded stream.
    • It can handle both batch and stream processing in the same system.
  • Flink Advantages:
    • Flink’s streamfirst approach offers low latency, high throughput, and real entry-by-entry processing.
    • It is unique since it allows true stream processing.
    • It has its own garbage collection system that allows for better memory management.
    • It can automatically perform some optimizations (while it has to be done manually in Spark).
    • It handles data partitioning and caching automatically.
    • Flink offers a web-based scheduling view to easily manage tasks and view the system.
  • Flink Disadvantages:
    • It is still a very young project.
    • Large-scale deployments in the wild are still not as common as other processing frameworks.

High-Efficient Fuzzy Querying with HiveQL for Big Data Warehousing 6

  • Apache Hive is a data warehousing framework working on top of the Hadoop platform for big data processing.
  • To overcome the challenges of big data changes have to be made that involve moving:
    • From fixed-schema data warehouses to schema-less data lakes for storing data.
    • From extract, transform, and load to extract, load, and transform procedures for data capturing and preparation.
    • From schema on write to schema on read data analysis strategies.
    • From scaling up to scaling out solutions for increasing the performance of analytic queries.
  • Hive queries and analyzes big data stored in the HDFS using a Structured Query Language (SQL)-like query language, called HiveQL.
  • FuzzyHive:
    • It is a library residing on top of Hive and implements the benefits of using fuzzy logic in data processing, querying, and analysis.
    • It supports elastic grouping and filtering on dimensional attributes.
    • It supports fuzzy joining on attributes shared by many tables of a big data warehouse.
    • It supports flexible ad hoc querying capabilities against the dynamically defined schema of a data warehouse built over a big data lake.

Video Resources 7 8 9

Apache Hive

  • Hive allows you to write SQL-like queries that will be converted into MapReduce jobs and executed on the Hadoop cluster (Spark, Tez, etc.).
  • Hive is not a database but a data warehousing framework; it can not be used for transactional processing, real-time processing, or online transaction processing.
  • Hive supports many file formats, including text, sequence, ORC, and Parquet.
  • Hive stores metadata in a relational database, such as MySQL, PostgreSQL, or Derby.
  • Hive provides good compression and indexing capabilities.
  • Supports user-defined functions (UDFs) and user-defined aggregates (UDAFs).
  • Supports specialized joins.
  • Hive vs RDBMS:
    • Hive enforces schema on read; while RDMBs enforce a schema on write.
    • Hive max storage of 100s of petabytes; RDBMS max storage of 10s of terabytes.
    • Hive does not support OLTP; RDBMS supports OLTP.

References


  1. Apache.com (2022). Built-in SQL Commands. Retrieved August 25, 2022 https://spark.apache.org/docs/latest/api/sql/index.html 

  2. Quick Start. (2022). Apache spark,3.3.0. https://spark.apache.org/docs/latest/quick-start.html 

  3. Damji, J.S., Wenig, B., Das,T., & Lee, D. (2020). Learning Spark (2nd Ed). O’Reilly Media, Inc. https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch04.html 

  4. Dayananda, S. (2022, March 28). Spark SQL tutorial – Understanding Spark SQL with examples. Edureka. https://www.edureka.co/blog/spark-sql-tutorial/ 

  5. Gurusamy, V., Kannan, S., & Nandhini, K. (2017). The real-time big data processing framework: Advantages and limitations. International Journal of Computer Sciences and Engineering, 5(12), 305-312. https://www.researchgate.net/profile/Vairaprakash-Gurusamy/publication/322550872_The_Real_Time_Big_Data_Processing_Framework_Advantages_and_Limitations/links/5a5f4658458515c03ee1d07c/The-Real-Time-Big-Data-Processing-Framework-Advantages-and-Limitations.pdf?_sg%5B0%5D=uwVJhtrrr3M7NIWrhvIs-UstxE9TJpysCbcZmRheshs4C7wYaf2wXw60_nBVeVXN4pimX1iEGMemSq2AnFnU2A.lhaUyDPsKIHsCSjF-je_WWD83to8fD-njoCBH-kRVpHA5r1iVi2HxQKrqHsTW8DUZXlxPAysqzM_MPkw0yepUQ&_sg%5B1%5D=02APsySDXJ2zV5Py7JWaQ3Mknh4SgUZDwNfLZx2rnRROkOIs4K5ZjZQJtTG02oWQTGNpSnpUDnVvxbhjt5krZVzjzNTTIt9bldDHRI14ZM68.lhaUyDPsKIHsCSjF-je_WWD83to8fD-njoCBH-kRVpHA5r1iVi2HxQKrqHsTW8DUZXlxPAysqzM_MPkw0yepUQ&_iepl= 

  6. Małysiak-Mrozek, B., Wieszok, J., Pedrycz, W., Ding, W., & Mrozek, D. (2022). High-efficient fuzzy querying with HiveQL for big data warehousing. IEEE transactions on fuzzy Systems, 30(6), 1823-1837. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9388934 

  7. BigDataELearning. (2017, December 17). What is Apache Hive? : Understanding hive [Video]. YouTube. https://www.youtube.com/watch?v=cMziv1iYt28 

  8. Simplilearn. (2017, November 2). Big Data Tools and Technologies | Big Data Tools Tutorial | Big Data Training | Simplilearn [Video]. YouTube. https://youtu.be/Pyo4RWtxsQM?t=103 

  9. Computerphile. (2018, December 12). Apache spark- computerphile [Video]. YouTube. https://www.youtube.com/watch?v=tDVPcqGpEnM