JA1. Introduction to Big Data¶
Statement¶
Reflect on the learning from this week around the fundamentals of big data and respond to the following:
- Compare and contrast the three base elements of big data (volume, velocity, and variety).
- What role do you feel data quality plays in the overall importance of big data collection and analysis? How does it impact these three base elements?
Answer¶
Introduction¶
Characteristics of big data, also known as the Vs of big data, are properties that can be used to describe and differentiate sets of big data. The three base elements of big data are volume, velocity, and variety; later on, three more Vs were added: Veracity, Variability, and Value (Tyagi, 2019); but it can take as many Vs as the context of the problem that is being solved requires.
The quality of the data is an important factor in the big data realm; however, Firican (2017) states that value is the most important element of big data. The text will discuss the three base elements of big data: volume, velocity, and variety; and then the role of data quality in big data collection and analysis.
Compare and contrast the three base elements of big data (volume, velocity, and variety)¶
Volume refers to the amount or size of data that is being generated from different sources such as mobile clients or sensors; such data is sent to servers where it is being stored and processed (Taylor, 2022). Volume is represented by size units and it is usually as large as terabytes, petabytes, and exabytes.
Velocity refers to the speed at which data is being generated and sent to the servers (Taylor, 2022). This can describe the limits of clients, servers, and the network itself. Velocity is measured by size in the unit of time, for example, a single client generates 1 MB of data per second, and the server can handle 100 MB of data per second (thus it can handle 100 clients).
Variety refers to the different types of data being sent to servers which can be structured, semi-structured, or unstructured (Taylor, 2022). Big data systems can state how many types of data they can receive, process, or send back. For example, a server may accept both structured and semi-structured JSON data, and the client can specify what type of data it wants to receive. Unstructured data is usually the hardest to process but with the rise of Large Language Models (LLMs) like GPT-3, it may be possible to process it.
What role do you feel data quality plays in the overall importance of big data collection and analysis?¶
Data quality, or Veracity, refers to the trustworthiness, uncertainty, and quality of the data. The data may be incomplete, inconsistent, erroneous, duplicated, or false (Sebastian, 2022). The analysis process aims for high-quality data, but if such data is not available, it should try to increase the quality of the data by cleaning it, removing duplicates, filling in missing data, and other techniques.
High-quality data are usually more reliable and lead to better decision-making, however, it is hard to achieve. High-quality data is genuine, which means that every event sent for analysis is generated based on an authentic event that happened in real life; it is semantically known and useful; it also contains all the necessary information to be processed as missing data will be guessed or ignored, and in both cases, the final result will be less accurate.
Determining the quality of the data is a hard task, especially if you don’t own the entire collection process or event sources. However, after analyzing incoming data and reaching a reference point, systems can tell if data quality has decreased or increased compared to that reference point.
How does quality impact these three base elements?¶
Data quality affects volume as high-quality data is hard to get, its volume is usually less than low-quality data; thus, low-quality data wastes unnecessary space and resources.
Data quality affects velocity as low-quality data requires extra processing and cleaning steps before it can be used or analyzed; this extra step can slow down the processing velocity of the data, but it does not affect sending velocity.
Data quality affects variety as high-quality data is usually similar in shape and semantics (in the context of a specific problem) which means that high variety may indicate low-quality data.
Conclusion¶
Data quality is important; it impacts other elements of big data such as volume, velocity, and variety. High-quality data is usually the ultimate goal for any collection and analysis systems, but it is hard to get. Each of these elements is also important on its own, but value remains the most important of the Vs of big data.
References¶
- Firican, G. (2017). The 10 Vs of Big Data | TDWI. TDWI. https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx
- Sebastian B. (2022, October 25). What Does Big Data Mean? FutureLearn. https://www.futurelearn.com/info/courses/applied-big-data-analytics/0/steps/52404
- Taylor, D. (2022, March 26). What is big data? Introduction, types, characteristics, examples. Retrieved May 16, 2022. Guru99. https://www.guru99.com/what-is-big-data.html
- Tyagi, V. (2019, January 10). 5 V’s of big data. GeeksforGeeks. https://www.geeksforgeeks.org/5-vs-of-big-data/