WA2. Databases for analytics¶

Statement¶

Compare a traditional database, with an analytical database, and a NoSQL database.
Compare three examples; each should be drawn from one of the following areas that we discussed:
- Databases (a traditional database, an analytical database, NoSQL database)
- Statistics Packages (such as SPSS, SAS, R, MiniTab, and MATLAB)
- API (including WEKA, Orange, Statistica, and Hadoop)

Part 1¶

Traditional database¶

Traditional databases usually refers to relational databases. They are designed to store data in tables, and to provide a way to query the data using SQL. They are usually ACID compliant, which means that they are designed to be consistent and reliable. They are usually not designed to be distributed, and are not designed to be used for analytics.

They usually come in a form of complex database management systems (DBMS) that are designed to be used by multiple users at the same time, but usually in a single machine; thus the overhead of managing transactions, including locking and concurrency control, is high. They are usually designed to be used for transactional data, and not for analytics (Scylla, n.d.).

Analytical database¶

With the data sizes growing, and the need for complex analytics, the traditional databases are not enough; analytical databases focus on reading data and processing more than writing; and it is designed to run jobs in background such that immediate results are not required. They can be row or column oriented, they can be distributed (as they are not ACID compliant).

SQL-based analytical databases are available, and they run a modified version of SQL that is designed to be used for analytics, which reduces the overhead of the traditional SQL and ACID compliance. NO-SQL analytical databases are also available, but usually with a SQL-compatible query language as it gives greater flexibility in executing complex queries; however, the underlying storage and process can vary from one database to another.

NoSQL database¶

Transactional NoSql databases were introduced to solve the problem of data massive growth, and the increasing usage of dedicated analytics databases; and the heavy overhead of SQL and ACID compliance that makes writes slow. They are designed to be distributed with little overhead comparing to the complex SQL database systems that comes with the cost of less flexibility in querying the data (thus, they are not suitable for analytics) (Scylla, n.d.).

Analytical NoSql databases are also available as mentioned above, they might have SQL-compatible query languages or their own, but the data is stored and queried completely differently than the traditional databases. They also might be distributed and relies on parallel processing and the fact that real-time results are not required (Scylla, n.d.).

Part 2¶

I will chose DynamoDB as an example of NoSql database, R as an example of statistics package, and Hadoop as an example of API.

NoSQL: DynamoDB¶

DynamoDB is a NoSQL database that is designed to be used for transactional data, and not for analytics. It is a key-value database, which means it returns data as semi-structured JSON-like objects, and it is designed to be distributed. It is not ACID compliant, and it is not designed to be used for analytics (AWS, n.d).

Statistics package: R¶

R is a programming language specialized in statistical computing and graphics. It provides wide variety of traditional statistics computations, graphical techniques and plotting, and comes with a large collection of packages that provides more advanced statistical techniques (R Project, n.d).

API: Hadoop¶

Hadoop is an open-source framework created in 2006 for storing and processing large datasets using clusters of computers (hardware nodes), where Hadoop is a software that runs on all of the connected nodes and provides a distributed file system that spans across these nodes and split/schedule jobs to utilize the processing power of all the nodes in the cluster (AWS, n.d).

The flow¶

Applications are connected to dynamoDB for transactional data storing and retrieval.
Hadoop is connected to dynamoDB to retrieve the data and process it according to an ETL pipelines which is a set of jobs that are executed in a specific order to extract the data from dynamoDB, transform it to a format that is suitable for analytics, and load it into a data warehouse.
The output of the ETL pipeline (Hadoop) is stored in a format that are suitable for R to read and process (such as CSV files or data frames).
Then R comes into play to read Hadoop’s output and process it using written R scripts that are designed to perform the required analytics.
The output of R is stored in a format that is suitable for human analysts to read and understand (tables, graphs, etc.).

References¶

Stedman C. (n.d.). Hadoop. TechTarget. https://www.techtarget.com/searchdatamanagement/definition/Hadoop
AWS. (n.d.). What is Amazon DynamoDB? https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html
R Project. (n.d.). What is R? https://www.r-project.org/about.html
Scylla. (n.d). SQL to NoSQL: Architecture Differences and Considerations for Migration