WA1. The Vs of Big Data¶

Statement¶

Write about the five V’s of big data. Specifically, describe each item and provide examples that represent each item. Discuss the importance of each of these items to the collection and analysis of big data.

Answer¶

Introduction¶

Big Data is a term that refers to a large volume of data that is generated by a variety of sources and types at a high velocity to the point that it is complex for traditional data management tools to handle it (Taylor, 2022).

The term also is associated with the three Vs: Volume, Velocity, and Variety; later on, three more Vs were added: Veracity, Variability, and Value (Tyagi, 2019); these Vs are also known as the characteristics of big data.

Volume¶

Volume refers to the amount or size of data that is being generated from different sources such as mobile clients, sensors, social media, and other sources; and then streamed to analytics systems, stations, or platforms for processing and analysis (Taylor, 2022).

Examples of Volume include the 300 hours of video uploaded to YouTube every minute, the 1.1 trillion photos taken in 2016, and the 6.2 billion gigabytes of monthly mobile data traffic in 2016 (Firican, 2017). Another example is Walmart dealing with more than 1 million customer transactions every hour which amounts to 2.5 petabytes of data that need to be stored every hour (Sebastian, 2022).

Volume is important because it literally distinguishes big data from just data; if the collective of the streamed data to a specific platform (server, database, etc.) is large enough to the point it exceeds the capability (CPU, Memory, etc) of the platform to handle it, then it is considered big data, assuming that the platform has a decent capacity. The platform must respond to the data volume by preparing a decent distributing file system, a scalable database, and any distributed computing solution.

Velocity¶

Velocity refers to the speed at which data is being generated and streamed to the analytics platforms. The data is generated at a high speed and must be processed and analyzed in real-time or near real-time (Taylor, 2022).

Examples of Velocity include Radio-Frequency Identification (RFID), Global Positioning System (GPS), Near-Field Communication (NFC), and Bluetooth devices and sensors that generate data at high speed or frequent intervals and send it to a system or platform for processing, the system would then detect specific patterns and activate certain actions based on the data received (Sebastian, 2022).

Velocity is important because it allows organizations to make decisions based on real-time data and prepare the right infrastructure that can handle the speed of the data being generated and processed. For example, a system that analyzes the traffic in a city must adapt to the velocity of both rush hours and early mornings when the traffic is low.

Variety¶

Variety refers to the different types of data being generated or different sources that send data to the same system. Data can be structured, semi-structured, or unstructured (Taylor, 2022)

Examples of Variety include a system that analyzes user interest in a specific product on Amazon; structured data may come from event listeners on the web page that tracks user clicks or scrolls. Semi-structured data may come from third-party APIs that display the product on their storefronts. Unstructured data may come from the reviews on that product where users can write natural language text or upload images.

Variety is important because it allows organizations to analyze data from different sources and types to get a better understanding of the problem that is being solved. Data with many sources is preferred since it may provide more insights and patterns that can be used to make better decisions. On the other hand, data with fewer sources are preferred if each source generates a completely different shape and semantics of data, this is easier to develop and maintain.

Veracity¶

Veracity refers to the trustworthiness, uncertainty, and quality of the data. The data may be incomplete, inconsistent, erroneous, duplicated, or false; and the processing system needs to be aware of that and try to clean data the data before processing it (Sebastian, 2022).

Examples of Veracity include two or more identical events being sent from a sensor instantly, the first event may be assigned high veracity while the latter may be assigned low veracity assuming that a specific fault in the sensor caused the duplication. Another example is that product reviews from users who are verified to have purchased the product should have higher veracity than those whose purchase cannot be verified.

Veracity is important because it adds confidence to the decisions made based on data that is high in trustworthiness. Systems must be able to detect untrusted or duplicated data and weigh it less in the analysis process.

Value¶

Value refers to the usefulness of the data to the organization or the problem that is being solved. Collecting and analyzing data that is irrelevant is a waste of resources and time (Firican, 2017).

Examples of Value include a system that analyzes the traffic in a city; the data collected from relevant sources such as traffic lights and cameras, moving vehicles, GPS systems, satellite images, and public transportation real-time information. The collected data is sent to servers that figure out the traffic patterns (busy vs empty streets) and update the maps servers. The final user queries the map server to get the best route to their destination with traffic info projected on the map.

Value is the most important of the Vs because it is the reason why organizations collect data in the first place. If the data collected are irrelevant then storing and analyzing it will not help, but rather make the computed decisions less accurate. The value of the specific data is determined by the problem that is being solved and the insights that are extracted from the data.

Conclusion¶

The text has discussed 5 of the Vs of Big Data: Volume, Velocity, Variety, Veracity, and Value. However, researchers identified and introduced more Vs according to the context of the data and the problem that is being solved; for example, Firican (2017) identified 10 Vs of Big Data by adding 5 more Vs: Variability, Validity, Vulnerability, Volatility, and Visualization to the Vs mentioned above. Value is the most important of the Vs because it is the reason why organizations collect data in the first place. The other Vs are also important but their importance varies according to the context of the data and the problem that is being solved.

References¶

Firican, G. (2017). The 10 Vs of Big Data | TDWI. TDWI. https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx
Taylor, D. (2022, March 26). What is big data? Introduction, types, characteristics, examples. Retrieved May 16, 2022. Guru99. https://www.guru99.com/what-is-big-data.html
Tyagi, V. (2019, January 10). 5 V’s of big data. GeeksforGeeks. https://www.geeksforgeeks.org/5-vs-of-big-data/
Sebastian B. (2022, October 25). What Does Big Data Mean? FutureLearn. https://www.futurelearn.com/info/courses/applied-big-data-analytics/0/steps/52404

‌