1. Introduction to Big Data¶
What is Big Data? 1¶
- The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three “Vs.”
- Volume: The size of data. High volumes of low-density, unstructured data.
- Velocity: The rate at which the data is streaming in and processed. Faster velocity means data is streamed into memory rather than written to disk.
- Variety: The number of different types of data are being processed. This can be structured, unstructured, or semi-structured data. Dealing with structured data is usually easier than unstructured data (they require extra processing to be usable).
- Later two more Vs were added to the definition of big data:
- Veracity: The quality of the data being captured can vary greatly. It is important to ensure the data is accurate. How truthful is the data?
- Value: The value of the data is important. The data must be valuable to the organization.
History of Big Data¶
- The origins of big data go back to the 1960s-1970s.
- 2005: Hadoop was created and NoSQL databases were introduced.
- IoT objects are now generating data just like internet users.
- Machine learning and AI models are also generating data.
- Graph databases are being used to store and analyze data.
- 2014: Apache Spark was introduced.
Big Data Use Cases¶
- Product Development:
- Companies like Netflix and Procter & Gamble use big data to anticipate customer demand.
- They build predictive models for new products and services by classifying key attributes of past and current products or services and modeling the relationship between those attributes and the commercial success of the offerings.
- They use big data to plan, produce, and launch new products.
- Predictive Maintenance:
- Factors that can predict mechanical failures may be deeply buried in structured data, such as the year, make, and model of equipment, as well as in unstructured data that covers millions of log entries, sensor data, error messages, and engine temperature.
- By analyzing these indications of potential issues before the problems happen, organizations can deploy maintenance more cost-effectively and maximize parts and equipment uptime.
- Customer Experience:
- Big data enables you to gather data from social media, web visits, call logs, and other sources to improve the interaction experience and maximize the value delivered.
- Start delivering personalized offers, reduce customer churn, and handle issues proactively.
- Fraud and Compliance:
- Big data helps you identify patterns in data that indicate fraud and aggregate large volumes of information to make regulatory reporting much faster.
- Machine Learning:
- We are now able to teach machines instead of programming them.
- The availability of big data to train machine learning models makes that possible.
- Operational efficiency:
- With big data, you can analyze and assess production, customer feedback, returns, and other factors to reduce outages and anticipate future demands.
- Big data can also be used to improve decision-making in line with current market demand.
- Drive innovation:
- Big data can help you innovate by studying interdependencies among humans, institutions, entities, and processes and then determining new ways to use those insights.
- Use data insights to improve decisions about financial and planning considerations.
- Examine trends and what customers want to deliver new products and services.
- Implement dynamic pricing. There are endless possibilities.
How Big Data Works¶
- Integrate:
- Bring data from various disparate sources together.
- Traditional data integration tools such as ETL (extract, transform, load) are used to integrate data.
- During integration, you need to bring in the data, process it, and make sure it’s formatted and available in a form that your business analysts can get started with.
- Manage:
- Big data requires storage. Your storage solution can be in the cloud, on-premises, or both.
- You can store your data in any form you want and bring your desired processing requirements and necessary process engines.
- Analyze:
- Your investment in big data pays off when you analyze and act on your data. Get new clarity with a visual analysis of your varied data sets.
- Explore the data further to make new discoveries.
- Share your findings with others.
- Build data models with machine learning and artificial intelligence.
- Put your data to work.
Big Data Best Practices¶
- Align big data with specific business goals.
- Ease skills shortages with standards and governance.
- Optimize knowledge transfer with a center of excellence.
- Align unstructured data with structured data.
- Plan your discovery lab for performance and scale.
- Align with the cloud operating model.
Where does ‘Big Data’ come from? 2¶
- The term ‘Big Data’ has been in use since the early 1990s.
- Most people credit John R. Mashey (who at the time worked at Silicon Graphics) for making the term popular.
- The total amount of data in the world was 4.4 Zettabytes in 2013. That is set to rise steeply to 44 Zettabytes by 2020 (44 trillion gigabytes).
- The table below shows the phases of Big Data evolution:
Phase 1: 1970-2000 | Phase 2: 2000-2010 | Phase 3: 2010-Present |
---|---|---|
DBMS-based structured content | Web-based unstructured content | Mobile, sensor-based IoT content |
RDBMS and data warehousing | Information retrieval & extraction | Location-aware analysis |
ETL (Extract, Transform, Load) | Opinion mining | Person-centric analysis |
OLAP (Online Analytical Processing) | Question answering | Context-relevant analysis |
Dashboards and Scorecards | Web analytics & Web intelligence | Mobile visualization |
Data mining and statistical analysis | Social media analytics | Human-computer interaction |
Social network analysis | ||
Spatial-temporal analysis |
Characteristics of Big Data 3¶
- Data is the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
- Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of the traditional data management tools can store it or process it efficiently. Big data is also data but with a huge size.
Examples of big data¶
- The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade data per day.
- The statistic shows that 500+ terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, comments etc.
- A single Jet engine can generate 10+ terabytes of data in 30 minutes of flight time. With many thousand flights per day, the generation of data reaches up to many Petabytes.
Types of Big Data¶
- Structured:
- Data that can be stored, accessed, and processed in fixed formats.
- Examples include relational databases and spreadsheets.
- Unstructured:
- Data that has an unknown form or structure.
- Examples include text files, multimedia content, etc.
- The size of unstructured data is usually bigger than structured data.
- More examples of web applications: Log files, transaction history, etc.
- Semi-structured:
- Data that has the characteristics of both structured and unstructured data, that is, has a predefined format but does not fit into a relational database.
- Examples include XML, JSON, etc.
The Origins of Big Data 4¶
- Big Data has truly come of age in 2013 when Oxford English Dictionary introduced the term “Big Data” for the first time in its dictionary.
- Hadoop is built on top of Google’s MapReduce and Google File System papers.
- 2008: Google processed 20 petabytes of data a day.
Big data in context: Legal, social and technological insights 5¶
- Read the first chapter Big Data and Data Quality (pages 1 - 12) which provides an introduction to how big data and data quality are linked and the importance of having quality data for in-depth analysis.
A Study on the Challenges and Types of Big Data 6¶
- The web pages indexed by Google were around one million in 1998, however rapidly reached 1 billion in 2000 and have currently surpassed 1 trillion in 2008.
- CDR (call information record): A CDR is a data record produced by a telephone exchange or other telecommunications equipment that documents the details of a telephone call or other telecommunications transaction (e.g., text message) that passes through that facility or device.
- Analysis of disorganized information counts on keyword phrases, which enable users to filter the information based on searchable terms.
- Big Data starts with large-volume, heterogeneous, independent sources with dispersed and also decentralized control, and looks to check out complicated and evolving partnerships amongst data.
- Being self-governing, each data resource is able to generate as well as accumulate information without involving (or counting on) any type of streamlined control.
- Challenges of big data:
- Privacy, security, and trust.
- Data management and sharing.
- Technology and analytical systems.
The 5 V’s of Big Data 7¶
- Volume: The amount of data generated is very high.
- Velocity: The speed at which data is generated.
- Variety: The different types of data generated (structured, unstructured, semi-structured) or how heterogeneous the data sources are.
- Veracity: The quality, inconsistency, uncertainty, and trustworthiness of the data.
- Value: The value of the data generated to the organization.
- Variability: The speed at which the shape of the data is changing.
Video Resources 8 9¶
- Big data is used in healthcare, finance, retail, and many other industries 8.
- Big data challenges:
- Price.
- Storage Capacity.
- Data Security.
- Availability.
- Quality Assurance.
- Compliance Issues.
- Skills Shortage.
- 40 Exabytes of data are generated every month by a single smartphone user 9.
- Frameworks for handling big data:
- Cassandra.
- Hadoop:
- It uses distributed file systems that store data across multiple machines.
- MapReduce is used to process data where data is divided into smaller parts and processed in parallel, and the results are combined.
- Spark.
References¶
-
What is big data? (2022). OCI. https://www.oracle.com/big-data/what-is-big-data/ ↩
-
Where does ‘big data’ come from? (2019, March 26). Enterprise Big Data Framework. https://www.bigdataframework.org/short-history-of-big-data/ ↩
-
Taylor, D. (2022, March 26). What is big data? Introduction, types, characteristics, examples. Retrieved May 16, 2022. Guru99. https://www.guru99.com/what-is-big-data.html ↩
-
Dontha, R. (2017). The Origins of Big Data. KD nuggets. https://www.kdnuggets.com/2017/02/origins-big-data.html ↩
-
Hoeren, T., Kolany-Raiser, B. (2017). Big data in context: Legal, social and technological insights. Springer Nature. DOI: 10.1007/978-3-319-62461-7 licensed under CC by 4.0. https://library.oapen.org/handle/20.500.12657/27850 ↩
-
Mannava, P. (2013). A Study on the challenges and types of big data. International Journal of Innovative Research in Science, Engineering and Technology, 2(8). Retrieved from https://www.researchgate.net/publication/342003973_A_Study_on_the_Challenges_and_Types_of_Big_Data ↩
-
Tyagi, V. (2019, January 10). 5 V’s of big data. GeeksforGeeks. https://www.geeksforgeeks.org/5-vs-of-big-data/ ↩
-
Eye on Tech. (2020, February 25). What is big data and what is it used for? [Video]. YouTube. https://www.youtube.com/watch?v=jH44SfUNpWw ↩↩
-
SimpliLearn. (2019, December 10). Big data in 5 minutes | What is big data | Introduction to big data | Big data explained | SimpliLearn [Video]. YouTube. https://www.youtube.com/watch?v=bAyrObl7TYE ↩↩