Skip to content

WA4. Techniques for Querying Big Data

Statement

  • Identify the three different querying techniques used to query big data that can benefit organizations.
  • Discuss how organizations are implementing the three techniques you mentioned for querying big data.

Answer

Introduction

The Vs of big data represent the characteristics of such data; starting with the size of the data represented by volume, the speed at which the data is generated and processed represented by velocity, the variety of data types and sources represented by variety, the truthfulness of the data represented by veracity, and the business importance of the data represented by value.

Each of these characteristics presents challenges to organizations when querying big data. To overcome these challenges various techniques were used that involved moving: from fixed-schema data warehouses to schema-less data lakes for storing data; from extract, transform, and load to extract, load, and transform procedures for data capturing and preparation; from schema on write to schema on read data analysis strategies; from scaling up to scaling out solutions for increasing the performance of analytic queries (Mrozek et al., 2022).

The default querying technique is to use SQL-like languages like HiveQL. Mrozek et al. (2022) identified more techniques using FuzzyHive which is a wrapper on top of HiveQL that allows for more elasticity in querying big data. Other techniques include using NoSQL databases like MongoDB and Cassandra; using stream or micro-batch processing over batch processing; and using in-memory caching for faster query execution.

Our research did yield decisive answers about what can we call “querying techniques” for big data, but the text below discusses three techniques: using SQL-like languages like HiveQL, using frameworks’ exposed APIs, and using stream and micro-batch processing.

Technique 1: Using SQL-like Languages like HiveQL

Almost all processing frameworks accept SQL-like syntax for querying data. It is a way of utilizing the famous SQL language for querying big data without forcing analysts to learn a new thing. The way this works is different from DBMSs; depending on the framework the data is loaded -in disk or memory- into relevant data structures such as DataFrames and Datasets; an engine will convert those queries into a series of operations that the query processor passes over the data to evaluate them (Damji et al., 2020).

The distributed nature of big data frameworks like Hadoop (MapReduce) and Spark (in-memory) allows for the parallel execution of queries on large volumes of data. This is suitable for batch processing where the use case requires going through the entire finite dataset to extract insights (Gurusamy et al., 2017). Examples of this will be using Hadoop’s HiveQL to query all data gathered from all hospitals in a country to examine the correlation between some lifestyle choices and the prevalence of a disease.

Technique 2: Using Frameworks’ Exposed APIs

Big data frameworks like Spark, Flink, and Storm provide APIs to read and write data in different formats like Avro, Parquet, ORC, JSON, and CSV. These APIs allow for more flexibility in querying the data as the step of transforming the query data into a common format is skipped. This is beneficial for organizations that require complex queries where using SQL-like languages is not enough or not efficient (Gurusamy et al., 2017).

The framework takes care of the effective execution of the query in batch-processing mode. Such an option is also suitable for people who come from a programming background and are more comfortable with writing code than writing SQL queries. An example of this will be using Spark’s DataFrame API to query data from a Cassandra database to find the average number of orders per customer.

Technique 3: Using Stream and Micro-batch Processing

Not all use cases require passing through the entire dataset nor all datasets are finite; using stream or micro-batch processing frameworks like Apache Storm, Samza, and Flink is well-suited when processing time is of a concern such as the need to return immediate feedback to the user in a near real-time manner (Gurusamy et al., 2017). These frameworks process data as it arrives, either individually or in small batches, and execute the query on the data.

The intermediate results are then stored in memory, disk, or passed to another system for further processing. This is suitable for use cases like fraud detection, real-time monitoring, and recommendation systems. An example of this will be using Flink to process data from a Kafka topic and return the top 10 most viewed products in the last 5 minutes.

Conclusion

The research for this assignment did not yield exact answers about what is considered a querying technique for big data; thus, the introduction has listed a wide variety of the techniques mentioned in the literature and then explained three techniques that are considered beneficial for organizations. We don’t consider the explained techniques as the only techniques for querying big data, but they are common and widely used and each of them is listed with a suitable use case.

The querying techniques used for big data are varied and depend on the use case and the organization’s requirements. SQL-like languages like HiveQL are suitable for batch processing and querying large volumes of data. Frameworks’ exposed APIs are suitable for complex queries and for organizations that require more flexibility in querying. Stream and micro-batch processing are suitable for real-time processing and for use cases that require immediate feedback.

References