Skip to content

WA7. MapReduce Assumptions

Statement

In this week’s reading by Lev-Libfled and Margolin (2019), the authors discuss the fact that the MapReduce paradigm is based on several assumptions, which include: the completeness of data, independence of data set calculations, and relevancy distinguishability.

  1. Describe what each of these assumptions means.
  2. How will this impact the MapReduce paradigm, if you fail to evaluate the above-said assumptions?
  3. What will the effect of the impact on big data security be?

Answer

Introduction

MapReduce paradigm is a process consisting of two phases; the Map phase reads data from input sources, processes it, generates a set of key-value pairs, sorts them, and finally passes the results to the Reduce phase. The Reduce phase then takes the key-value pairs, groups them by key, and processes them to generate the final output (Kerzner & Maniyam, 2016).

For the MapReduce paradigm to work effectively, it relies on several assumptions about the data, and it is the responsibility of the data engineer to prepare data according to these assumptions. The text will discuss three assumptions of the MapReduce paradigm and the impact of failing to evaluate these assumptions on the results themselves and the security of the data

Assumptions of MapReduce

MapReduce may not be the best solution for all data sets and problems, because of the parallelism and the distributed nature of MapReduce, it requires that data does not have dependencies between each other during processing. It ensures that through some assumptions such as completeness of the data, independence of data set calculations, relevancy distinguishability, and others (Lev-Libfeld & Margolin, 2019).

Completeness of the data means that the data set is finite and all of it is available before starting the MapReduce process; this is important to generate the correct key-value pairs and ensure that results are correctly combined; this makes MapReduce a batch processing system and unsuitable for real-time processing.

The independence of data set calculations means that the data set can be divided into smaller parts and processed independently of each other; this is important to ensure that the nodes can process data the subset of data assigned to them in parallel without depending on other nodes; this allows MapReduce systems to scale horizontally to as many nodes as needed which reduces the processing time and allows accepting larger data sets.

The relevancy distinguishability means that MapReduce can determine which data is relevant to which Job and which data is not; this is important to ensure that the Map phase assigns data from the input source to the correct node; this allows MapReduce systems to process data efficiently and remove the need for transferring data between nodes.

Failure to Evaluate Assumptions

Systems that rely on the MapReduce paradigm such as Hadoop carry out its assumptions; while such systems may crash if these assumptions are not met, some systems may continue to work with penalties such as increased processing time, memory usage, and network traffic. But the worst-case scenario is that the MapReduce process finishes working and generates incorrect results without any indications or a way to tell why the results are wrong.

Failing to ensure the completeness of the data may result in later-coming data being assigned to the wrong node or the wrong key-value pair; this may result in the results being combined in the wrong way or simply omitted from the final output; this may be a major issue and MapReduce systems usually refuse to add new data to the cluster while processing is ongoing.

Failing to ensure the independence of data set calculations may result in nodes depending on the work of other nodes to complete their work; in distributed processing, this is inefficient or sometimes impossible to achieve as each node needs to know where the data that it needs is located or request it from the master node and then wait for it if the results are not ready; however, MapReduce systems are designed to work independently and may not have the capability of inner-node communication during processing.

Failing to ensure the relevancy distinguishability may result in the Map phase assigning data to the wrong node or the wrong job; this is a minor issue as MapReduce systems are concerned which means that they are not usually able to detect such issues; this issue requires deep knowledge of the data and the relationships between objects to prevent it from happening.

Impact on Big Data Security

The impact of failing to evaluate the assumptions of the MapReduce paradigm on big data security comes from the fact that failing to meet these assumptions may generate incorrect results, and if any of these results are used to make security-related decisions, then the security of the corresponding systems may be compromised in ways that allow data leakage, unauthorized access, or misuse.

Failing to ensure the completeness of the data means that the results may be incomplete or incorrect which may result in making wrong security decisions such as allowing a program to run because not all security rules are loaded.

Failing to ensure the independence of data set calculations means that data may need to travel through the network between nodes while processing; this may expose the data to unauthorized access or a lot of network-related vulnerabilities if the data is not encrypted during transfer.

Failing to ensure relevancy distinguishability may result in irrelevant data being considered relevant and thus exposed to unauthorized access or used in the wrong context.

Conclusion

MapReduce paradigm relies heavily on parallelism for scaling and the ability to handle large data sets; such a paradigm imposes some limitations that are expressed through assumptions that the data engineer must evaluate and ensure they are met before starting the MapReduce process. Failing to evaluate these assumptions may result in incorrect results, increased processing time, and network traffic, and may expose the data to unauthorized access or misuse.

References