3. Analytical Theories and Methods¶

Cluster Analysis ¹¶

Clustering is the practice of essentially grouping data points into similar groups for comprehensive data analysis and reporting.
It is one of the main tasks used in the process of statistical analysis, pattern recognition, data compression, and computer graphics.
Unsupervised learning is a type of machine learning that searches for patterns in a data set with no pre-existing labels and a minimum of human intervention.
Some of the most important approaches or methods used in data analysis include regression analysis, simple linear regression analysis, hypothesis analysis, null hypothesis, content analysis, discourse analysis, grounded theory, and cross-tabulation.
There are many clustering algorithms, simply because there are many notions of what a cluster should be or how it should be defined.
There are more than 100 clustering algorithms that have been published to date.
Clusterings or sets of clusters are often distinguished as:
- Hard clustering: Each data point either belongs to a cluster completely or not.
- Soft clustering: Each data point belongs to a cluster to some degree.
Clustering analysis methods include:
- K-means: finds the clusters by minimizing the mean distance between data points and the centroid of the cluster.
- DBSCAN: Density-Based Spatial Clustering of Applications with Noise.
- Spectral Clustering: similarity graph-based algorithm that models the nearest neighbor relationships between data points as undirected edges in a graph.
- Hierarchical Clustering: groups data points into a tree of related graphs starting with each data point as a separate cluster and merging them into larger clusters.

Data Mining - Cluster Analysis ²¶

The main advantage of clustering vs classification is that it is adaptable to changes and helps single out useful features that distinguish different groups.

An Overview of Big Data Analysis ³¶

In the industrial sector, more and more sensors are being incorporated into intelligent products, production equipment, and production monitoring.
Smart production, which has become a vital component of production in the Industry 4.0 era, is facing several challenges mainly related to the following features:
- Heterogeneous data: each sensor comes from a different manufacturer and has different data formats and parameters. Allowing interactions between all these sensors is a challenge.
- Multi-source data: In addition to streaming data in real-time, there are also legacy applications that manage existing devices in companies and that continue to use legacy software despite being dated or no longer supported.
- Large data volume: The data generated by sensors is very large and requires a large amount of storage space.
Use cases of big data analysis include:
- Conduct specific marketing and customer loyalty campaigns based and customer behavior collected from social networks.
- Detection of production anomalies in real-time.
- Prediction of the energy consumption of a production plant or a technological plant in green smart cities.
- Risk management in production processes or driverless cars.

Research and Application of Clustering Algorithm for Text Big Data ⁴¶

In the era of big data, text as an information reserve database is very important, in all walks of life.
Discovering knowledge from text data in a high-speed and accurate manner is a major challenge in large-text data mining.
Text data generated in these practical problems of humanities research, financial industry, marketing, and other fields often has obvious domain characteristics, often containing the professional vocabulary and unique language patterns in these fields and often accompanied by a variety of noise.
Algorithms used in text big data:
- K-means clustering:
  - It is simple, widely used, and has low time complexity.
  - It is hard to determine the right number K and the initial clustering center.
  - It has low efficiency and reduced accuracy and is subjective to the initial clustering center.
- Drift clustering.
- Mean shift clustering.
- Random walk clustering.
- Fuzzy c-mean clustering model (FCM).

Text Big Data and Cluster Analysis¶

Text big data refers to the document data that is:
- Manifested in the form of text documents.
- Contains large amounts of information.
- High velocity and variety.
- Low data value.
Data mining refers to a process of searching for the information hidden in the data through algorithms, which is the process of analyzing the hidden and potentially valuable information contained in a large amount of data in a database.
The most common way of text big data processing is cluster analysis which is:
- Quantitative analysis.
- It has two perspectives:
  - Data analysis perspective: purely statistical analysis on multiple samples.
  - Data mining perspective: clustering analysis based on density, hierarchy, and partitioning.

What is Data Analysis and Why is it Important? ⁵¶

Data analysis tools:
- Programming languages like R or Python.
- Microsoft Excel.
Data Visualization tools:
- Tableau.
- Power BI.
- Google Data Studio.
Data analysis methods:
- Data mining.
- Text analytics.
- Business intelligence.
Data analysis process:
- Define objectives.
- Define the questions that your data analysis will answer.
- Data collection.
- Data scrubbing: cleaning and organizing data, removing noise and errors. So that analysis tools can import and analyze the data.
- Data analysis.
- Drawing conclusions and making predictions.
Data mining:
- Data mining is a method of data analysis for discovering patterns in large data sets using statistics, artificial intelligence, and machine learning.
- The goal is to turn data into business decisions.
Text analytics:
- Text analytics is the process of finding useful information from text. You do this by processing raw text, making it readable by data analysis tools, and finding results and patterns.
- This is also known as text mining.
- Excel does a great job with this.
Business intelligence:
- Business intelligence transforms data into intelligence used to make business decisions.
Data visualization:
- Data visualization is the visual representation of data. Instead of presenting data in tables or databases, you present it in charts and graphs. It makes complex data more understandable, not to mention easier to look at.

Massively Parallel ⁶¶

Massively Parallel Program:
- It is a program that can run multiple threads that do not interact with each other at all.
- This minimizes the overhead and leads to ideal scaling.
When you run a Parallel Java 2 task on a multicore node:
- Only one process is involved (the pj2 process).
- The task’s parallel team threads all run within that same process.
But when you run a Parallel Java 2 job on a cluster:
- A whole bunch of processes running on multiple nodes have to get involved under the hood.
- These processes constitute the Parallel Java 2 middleware.
- The processes communicate with each other over the cluster’s backend network using TCP sockets.
- There is a Tracker process that keeps track of all the other processes.
- Each process is a Job that stays in contact with the Tracker.
- When resources are available and everything is good to go, Tracker asks the Job to start a Task(s) Process.
- Each Task process runs a Task, Backend, and Launcher.

The Importance of Data Analysis ⁷¶

Data is what you need to do analytics. Information is what you need to do business.
Companies worldwide use data to:
- Boost process and cost efficiency. (60%).
- Drive strategy and change. (57%).
- Monitor and improve financial performance. (52%).
Methods of data analysis:
- Descriptive analysis: answers the question “What happened?”. It looks at historical data and points out trends.
- Diagnostic analysis (exploratory): It looks through historical data to explain an issue or a challenge.
- Predictive analysis: It uses historical data to predict future events.
- Prescriptive analysis:
  - It uses historical data to recommend actions to take in the future.
  - It is a combination of predictive and descriptive analysis.
analysis tools use hundreds of statistical tools like Pareto, 80-20, cohort analysis, regression, text analysis, Porter analysis, fraud detection methods, etc to present data in the format we need depending on the business decision we wish to analyze.

What is Data Analysis? Importance, Types, Process & Methods ⁸¶

Types of Data Analysis Methods:
- Descriptive analysis:
  - It answers the question “What happened?”.
  - It is very important to identify and keep track of key performance indicators (KPIs).
  - Its business applications include monthly profit/income, sales reports, and the KPIs dashboard.
- Diagnostic analysis:
  - It answers the question “Why did it happen?”.
  - It covers the descriptive analysis and goes deeper to find out more details.
  - It is used to find out more information to connect the dots, develop patterns, and create trends.
  - It is a critical step in gathering well-detailed information to ensure there is enough data to investigate problems that are yet to come.
  - Example: Cargo company investigating its slow delivery times at a specific region.
- Predictive analysis:
  - It answers the question “What is likely to happen?”.
  - It is used to analyze past data to predict future outcomes.
  - It uses both descriptive and diagnostic analysis to predict future trends.
  - It depends on the statistical figures and numbers; therefore, it needs technology and resources to complete the forecast.
  - Since the forecast of predictive analysis is an estimate, the accuracy of the final prediction depends on the quality and relevancy of the data.
  - It is not as widely used as the previous two methods because:
    - It is difficult to do predictive analysis without a large amount of data.
    - It requires skills and resources that are not available to every business.
  - Its business applications include:
    - Forecasting sales.
    - Risk assessment.
    - Finding out leads who are likely to convert.
- Prescriptive analysis:
  - It answers the question “What should we do?”.
  - It covers descriptive, diagnostic, and predictive analysis to generate a decisive action that prevents a future problem or solves a current one.
  - It uses art, science, technology, and data analysis approaches to make a decision.
  - It requires a great amount of commitment that the business is willing to take.
  - It uses AI and machine learning models.
Data Analysis Process:
- Define the problem and its objectives.
- Data collection.
- Data cleaning.
- Data analysis.
- Communicate the results.

Big Data Clustering: Algorithms and Challenges ⁹¶

Challenges of Big Data¶

3Vs of big data: increased volume, velocity, and variety.
HACE (Heterogeneous, Autonomous, Complexity, Evolving):
- Heterogeneous:
  - Data comes from different sources.
- Autonomous:
  - Data is generated from independent sources.
  - There is no central control over these sources.
- Complexity:
  - Data itself is complex due to its heterogeneous and decentralized nature.
  - Contexts of data are also complex as the same data means different things in different contexts.
  - Parallel processing of the data is also complex.
  - The complexity of data increases as its 3Vs increase.
- Evolving:
  - Data is continuously evolving and changing.

Big Data Clustering¶


Big Data Clustering Techniques

Single machine clustering¶

Data Mining Algorithms Clustering:
- Portioning, Hierarchical, Density, Grid, and Model-based algorithms.
- It is an unsupervised classification of patterns in data.
- Partitioning-based clustering algorithms:
  - It divides a data set into K partitions using a distance to classify points based on their similarities.
  - Drawback: It required pre-specifying the number of clusters (K) which is non-deterministic.
  - Algorithms: k-means, k-medoids, k-modes, PAM, CLARA, CLARANS, and FCM.
- Hierarchical clustering algorithms:
  - It creates a tree of clusters where each node is a cluster of the data points.
  - Drawback: once a step is taken, it cannot be undone.
  - Algorithms: BIRCH, CURE, ROCK, and CHAMELEON.
- Density clustering algorithms:
  - It groups data points based on their density which finds clusters of arbitrary shapes, as clusters are areas of high density separated by areas of low density.
  - Drawback: not suitable for large data sets.
  - Algorithms: DBSCAN, OPTICS DBCLASD, and DENCLUE.
- Model-based clustering algorithms:
  - It assumes that the data is generated by a mixture of several probability distributions.
  - Drawback: It is sensitive to the initial values of the parameters.
  - Algorithms: EM, COBWEB, CLASSIT, and SOM.
- Grid-based clustering algorithms:
  - It divides the data space into a finite number of cells that form a grid structure, and then it deletes and merges adjacent cells until it reaches the final grid structure.
  - Advantage: reduced time complexity.
  - Algorithms: GRIDCLUS, STING, CLICK and WaveCluster.
Dimension Reduction:
- Feature selection and feature extraction.
- The data size can be measured in two dimensions, the number of variables and the number of examples.
- Dimension reduction is the process of reducing the number of variables in a data set.
- Its purpose is to select or extract the optimal subset of relevant features for a criteria already fixed.
- It generates a new set that is more representative of the problem, and it is usually done before clustering.
- Feature Selection:
  - It selects the optimal subset of variables from a set of original variables.
  - Algorithms: parallel k-means.
- Feature Extraction:
  - It generates a new set of variables that are a combination or after applying a computation to the original data.
  - Algorithms: PCA, LS-SVM.

Multi-machine clustering¶

Parallel Clustering Algorithms:
- Parallel k-means and parallel fuzzy c-means.
MapReduce-based Clustering Algorithms:
- MapReduce-based k-means, EM, GPU, and DBCURE-MR.

Video Resources ¹⁰ ¹¹ ¹²¶

Clustering is an unsupervised learning technique that involves grouping data points into clusters based on their similarities.
Every variable needs to be standardized before clustering, that is converted to an integer or binary value.

References¶

nvida.com. (2021, July 2). Cluster analysis. https://www.nvidia.com/en-us/glossary/data-science/clustering/ ↩
tutorialspoint.com. (2022). Data mining - cluster analysis. https://www.tutorialspoint.com/data_mining/dm_cluster_analysis.htm ↩
Arena, F., & Pau, G. (2020). An overview of big data analysis. Bulletin of Electrical Engineering and Informatics, 9(4), 1646-1653. https://www.beei.org/index.php/EEI/article/view/2359/1532 ↩
Chen, Z. L. (2022). Research and application of clustering algorithm for text big data. Computational Intelligence and Neuroscience, 2022, 8 pages. https://doi.org/10.1155/2022/7042778 ↩
Grant, A. (2020, January 3). What is data analysis and why is it important? MUO. https://www.makeuseof.com/tag/what-is-data-analysis/ ↩
Kaminsky, A. (2015). Chapter 14 - Massively Parallel. Big CPU, big data: Solving the world’s toughest computational problems with parallel computing. https://my.uopeople.edu/pluginfile.php/1862282/mod_book/chapter/512678/Chapter%2014%20Massively%20Parallel.pdf See full book https://www.cs.rit.edu/~ark/bcbd_2/ ↩
Khemka, T. (2021, December 15). The importance of data analysis. Business 360. https://web.archive.org/web/20230425184207/https://b360nepal.com/the-importance-of-data-analysis/ ↩
Shaw, A. A (2020). What is data analysis? Importance, types, process and methods. Marketingtutor.net. https://www.marketingtutor.net/what-is-data-analysis/ ↩
Zerhari, B., Ait Lahcen, A., & Mouline, S. (2015). Big data clustering: Algorithms and challenges. https://www.researchgate.net/publication/276934256_Big_Data_Clustering_Algorithms_and_Challenges ↩
Computerphile. (2019, July 10). Data analysis 0: Introduction to data analysis [Video]. YouTube. https://www.youtube.com/watch?v=8GIbOJtUw8w ↩
Data science dojo. (2019, March 14). Introduction to clustering [Video]. YouTube. https://www.youtube.com/watch?v=4cxVDUybHrI ↩
Quantra. (2021, February 24). What is data analysis? | Why is it important? | How do you interpret and analyze data? | Quantra [Video]. YouTube. https://www.youtube.com/watch?v=Lh6frjuGuZM ↩

3. Analytical Theories and Methods¶

Cluster Analysis 1¶

Data Mining - Cluster Analysis 2¶

An Overview of Big Data Analysis 3¶

Research and Application of Clustering Algorithm for Text Big Data 4¶

Text Big Data and Cluster Analysis¶

What is Data Analysis and Why is it Important? 5¶

Massively Parallel 6¶

The Importance of Data Analysis 7¶

What is Data Analysis? Importance, Types, Process & Methods 8¶

Big Data Clustering: Algorithms and Challenges 9¶

Challenges of Big Data¶

Big Data Clustering¶

Single machine clustering¶

Multi-machine clustering¶

Video Resources 10 11 12¶

References¶

Cluster Analysis ¹¶

Data Mining - Cluster Analysis ²¶

An Overview of Big Data Analysis ³¶

Research and Application of Clustering Algorithm for Text Big Data ⁴¶

What is Data Analysis and Why is it Important? ⁵¶

Massively Parallel ⁶¶

The Importance of Data Analysis ⁷¶

What is Data Analysis? Importance, Types, Process & Methods ⁸¶

Big Data Clustering: Algorithms and Challenges ⁹¶

Video Resources ¹⁰ ¹¹ ¹²¶