What does the MapReduce framework consist of?
The MapReduce framework consists of the following components:
- Map function: it splits input data into smaller chunks, applies a mapping function to each chunk, and creates a series of key-value pairs.
- Reduce function: performs a reduction operation on the key-value pairs generated by a mapping function to obtain the final result.
- Distributed File System (HDFS): Used for storing input data and output results.
- JobTracker: responsible for overseeing the execution of the entire job. It assigns tasks to available nodes and monitors the progress of task execution.
- TaskTracker: responsible for carrying out specific tasks. It receives task assignments from the JobTracker, completes the tasks, and reports back to the JobTracker on the status of the task execution.
- Master node: Responsible for managing the entire execution process of the MapReduce job, including task scheduling and monitoring.
- Worker node: responsible for executing specific Mapper and Reducer tasks.
- Shuffle process: After the Map phase is completed, the output of the Mapper is sorted according to the key, and the results with the same key are distributed to the same Reducer.
- Combiner function: an optional intermediate reduction function used to reduce the amount of data transferred by performing a partial reduction on the output of the Map phase.
- Partitioner function: Distributes the output of Mapper to the corresponding Reducer based on the hash value of the key.
All these components together form the MapReduce framework, enabling the parallel processing of large datasets.