How to utilize MapReduce in Hadoop?

12 months ago

Benjamin Taylor

3 minutes

To use Hadoop’s MapReduce, you need to follow these steps:

Define the Map function: The Map function is the process of splitting input data into key-value pairs. You need to write a Map function to define how the input data is transformed into key-value pairs.
Definition of the Reduce function: The Reduce function is the process of handling the key-value pairs outputted by the Map function. You need to write a Reduce function to define how to process the key-value pairs outputted by the Map function.
Configure a MapReduce job: You’ll need to utilize Hadoop’s configuration file to set various parameters for the MapReduce job, such as input path, output path, Map function, and Reduce function.
Running MapReduce jobs: You can submit and run MapReduce jobs using Hadoop’s command line tools or programming interfaces.

Here is an example code using Hadoop MapReduce:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

This sample code is a simple word count program. It splits each word in the input file into key-value pairs, then counts the occurrence of each word. Lastly, it outputs each word along with its corresponding frequency.

You can use Hadoop’s command line tool to package the code into a JAR file, and then submit and run the MapReduce job using the following command.

hadoop jar WordCount.jar WordCount input output

WordCount is the name of the JAR file you package, input refers to the input file path, and output refers to the output file path.

Before running MapReduce jobs, make sure to install and configure the Hadoop cluster and ensure that the cluster is running.