noc19-cs33 Lec 06-Hadoop MapReduce 2.0 (Part-I)

3 min read 19 days ago
Published on Oct 26, 2024 This response is partially generated with the help of AI. It may contain inaccuracies.

Table of Contents

Introduction

This tutorial provides a step-by-step guide to understanding and implementing Hadoop MapReduce 2.0, as discussed in the lecture from IIT Kanpur. Hadoop MapReduce is a powerful framework used for processing large datasets across distributed computing environments. This guide will break down the core concepts and steps involved in using MapReduce effectively.

Step 1: Understand the Basics of Hadoop

  • Familiarize yourself with the Hadoop ecosystem, which includes the following components:
    • Hadoop Distributed File System (HDFS): A distributed file storage system that splits large files into smaller blocks.
    • YARN (Yet Another Resource Negotiator): Manages resources and scheduling in the Hadoop cluster.
    • MapReduce: A programming model for processing and generating large datasets.

Step 2: Grasp the MapReduce Programming Model

  • Learn the two main functions in the MapReduce model:
    • Mapper: Processes input data and produces intermediate key-value pairs.
      • Input is split into smaller chunks.
      • Each mapper processes a chunk and emits key-value pairs.
    • Reducer: Aggregates the intermediate key-value pairs produced by mappers.
      • Receives sorted key-value pairs.
      • Combines values associated with the same key to produce the final output.

Step 3: Set Up Your Hadoop Environment

  • Install Hadoop on your system or set up a cloud-based Hadoop cluster.
  • Ensure the following prerequisites are met:
    • Java Development Kit (JDK) is installed.
    • Environment variables (e.g., JAVA_HOME, HADOOP_HOME) are configured properly.

Step 4: Write Your First MapReduce Program

  1. Create a Java Project: Use an Integrated Development Environment (IDE) like Eclipse or IntelliJ.
  2. Add Hadoop Libraries: Include the necessary Hadoop libraries in your project dependencies.
  3. Implement the Mapper Class:
    public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] words = value.toString().split(" ");
            for (String word : words) {
                context.write(new Text(word), new IntWritable(1));
            }
        }
    }
    
  4. Implement the Reducer Class:
    public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
    
  5. Set Up the Job Configuration:
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(MyMapReduce.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    

Step 5: Execute Your MapReduce Job

  • Run your program using the Hadoop command line:
    hadoop jar YourJarFile.jar MyMapReduce /input/path /output/path
    
  • Check the output directory specified, which will contain the results of your MapReduce job.

Conclusion

In this tutorial, we covered the essential concepts and steps to get started with Hadoop MapReduce 2.0. You learned about the Hadoop ecosystem, the MapReduce programming model, and how to implement and execute a simple MapReduce job. For further learning, consider exploring more complex examples and other components of the Hadoop ecosystem, such as Hive or Pig, to enhance your data processing capabilities.