Skip to content

DA4. Challenges in Querying Big Data

Statement

Discuss three challenges organizations face when querying big data. Provide your justification for the three challenges you selected.

Answer

Introduction

The Vs of big data represent the characteristics of such data; starting with the size of the data represented by volume, the speed at which the data is generated and processed represented by velocity, the variety of data types and sources represented by variety, the truthfulness of the data represented by veracity, and the business importance of the data represented by value.

Each of these characteristics presents challenges to organizations when querying big data. The text below discusses three challenges: managing massive amounts of data (volume), integrating data from various sources (variety), and ensuring the quality of the data (veracity) (Miller, 2022).

Challenge 1: Managing Massive Amounts of Data (Volume)

The volume of big data is usually larger to fit on a single machine; volume is usually not a problem for storage as disks are relatively cheap. However, querying such large volumes and executing complex queries on them can be a challenge. Depending on the framework used, the data may need to be loaded into relevant data structures (either in-memory or on-disk) and then the query processor needs to pass over the data to evaluate the query.

Batch processing frameworks like Hadoop are designed to handle large volumes of data by distributing the data across multiple nodes in a cluster. While the size is not a concern for stream processing frameworks like Apache Storm and Spark Streaming, since they process items individually and the speed at which the data is generated and processed is the key (Gurusamy et al., 2017).

Challenge 2: Integrating Data from Various Sources (Variety)

Big data became big because the business model and the number of customers of the organization are big, this can translate to many sources of events and formats. This flexibility in dealing with various sources and formats is vital to any big data system, but it also presents a challenge when querying the data. The data that is stored in different formats needs to be transformed into a common format that the query processor can understand and evaluate.

We can see that many big data frameworks provide APIs to read and write data in different formats like Avro, Parquet, ORC, JSON, and CSV. Such formats need to be converted into the relevant data structures (DataFrames, Datasets, etc.) before they can be queried in a process that is time-consuming and resource-intensive as opposed to saving transformed data in the direct format.

Challenge 3: Ensuring the Quality of the Data (Veracity)

The veracity of big data refers to the quality of the data. The big data system has no control over the sources of the data and does not enforce any constraints on write. This means that data needs to be validated and cleaned before it can be queried. The data may contain missing, incorrect, or duplicate values that need to be handled before the data can be queried (Lawton, 2022). This raising quality process consumes time and resources, but it is essential for giving accurate results and reducing failures during the query execution.

Conclusion

Managing big data is a complex process that is full of challenges; however, big data frameworks showed resilience in handling these challenges. The volume challenge was dealt with by distributing the load and choosing stream, batch, or micro-batch models. The variety challenge was dealt with by supporting various common data formats in both read and write operations. The veracity challenge was dealt with by enforcing schema on read and thus ensuring that missing and incorrect values are handled before the data is queried.

References

‌ ‌