Apache HADOOP: MapReduce (Processing Big Data)

MapReduce

MapReduce is a programming model and processing framework for distributed computing on large datasets. It is a core component of the Apache Hadoop project and is widely used for parallel processing of data across a cluster of computers.

The MapReduce program works in two phases, namely:

Map
Reduce

Map tasks deal with the splitting and mapping of data. Reduce tasks shuffle and reduce the data (Aggregate, summarize, filter or transform).

Here's a breakdown of what MapReduce entails:

Map Phase: In the Map phase, the input data is divided into smaller chunks, and each chunk is processed independently by multiple Mapper tasks in parallel. The Mapper tasks produce intermediate key-value pairs as output.
Shuffle and Sort: After the Map phase, the intermediate key-value pairs are shuffled and sorted based on their keys. This process ensures that all values associated with the same key are grouped together and passed to the same Reducer task.
Reduce Phase: In the Reduce phase, the shuffled and sorted intermediate data is processed by Reducer tasks. Each Reducer receives a subset of the intermediate data with the same key and performs aggregation or computation on these values to produce the final output.
Output: The output of the Reducer tasks is typically written to a distributed file system, such as Hadoop Distributed File System (HDFS).

Job Tracker and Task Tracker:

JobTracker and TaskTracker are key components responsible for managing and coordinating the execution of MapReduce jobs across a cluster of machines. Here's an overview of each:

JobTracker:

The JobTracker is the central component of the MapReduce framework. It is responsible for accepting job submissions, scheduling tasks, monitoring their execution, and managing the overall workflow of MapReduce jobs. The JobTracker schedules MapReduce jobs onto available TaskTrackers (worker nodes) in the cluster.

TaskTracker:

The TaskTracker is a worker node component that runs on each machine in the Hadoop cluster. It is responsible for executing MapReduce tasks assigned by the JobTracker and reporting their status back to the JobTracker. The TaskTracker runs Map, Reduce, or other specialized tasks as directed by the JobTracker. It launches and monitors individual task attempts, handling task execution failures and retries.

The TaskTracker periodically sends heartbeats to the JobTracker to indicate that it is alive and functioning. These heartbeats also contain status updates on the tasks being executed, allowing the JobTracker to monitor progress and detect failures.

# MapReduce Code with MRJob

from mrjob.job import MRJob
from mrjob.step import MRStep

class WordCount(MRJob):
   def mapper(self, _, line):
      for word in line.split():
         yield(word, 1)

   def combiner(self, word, counts):
      yield(word, sum(counts))

   def reducer(self, word, counts):
      yield(word, sum(counts))

   def steps(self):
      return [
          MRStep(mapper = self.mapper, combiner = self.combiner, reducer = self.reducer)
      ]

if __name__ == "__main__":
   WordCount.run()

Running MRJob Files on Google Colab

MRJob is a powerful library that allows you to write and execute MapReduce jobs in Python, and Google Colab provides a convenient platform for running these jobs without the need for setting up a Hadoop cluster.

Step 1: Write the Code for Your MRJob Task on Colab Notebook and Make Input Text File

Before you can run a MapReduce job using MRJob on Google Colab, you need to write the code for your MRJob task. MRJob scripts typically consist of mapper, reducer, and combiner functions and a main function. Also make a text file for input.