noc19-cs33 Lecture 3-Hadoop Stack For Big Data

3 min read 4 hours ago
Published on Oct 26, 2024 This response is partially generated with the help of AI. It may contain inaccuracies.

Table of Contents

Introduction

This tutorial provides a comprehensive overview of the Hadoop stack for big data, based on the Lecture 3 of the noc19-cs33 series from IIT Kanpur. It covers essential components of the Hadoop ecosystem, their functionalities, and how they work together to manage and analyze large datasets. Understanding these concepts is crucial for anyone looking to leverage big data technologies in their projects.

Step 1: Understand Hadoop Framework

  • Hadoop Overview: Familiarize yourself with Hadoop as an open-source framework designed for distributed storage and processing of big data.
  • Components of Hadoop:
    • Hadoop Distributed File System (HDFS): A scalable and fault-tolerant file system that stores data across multiple machines.
    • MapReduce: A programming model for processing large datasets with a distributed algorithm on a cluster.
    • YARN (Yet Another Resource Negotiator): A resource management layer that manages computing resources in clusters and allocates them to various applications.

Step 2: Install Hadoop

  • Prerequisites:
    • Ensure Java is installed on your system.
    • Download the Hadoop binary from the official Apache Hadoop website.
  • Installation Steps:
    1. Extract the downloaded Hadoop tar file.
    2. Set up environment variables in your .bashrc or .bash_profile:
      export HADOOP_HOME=/path/to/hadoop
      export PATH=$PATH:$HADOOP_HOME/bin
      
    3. Verify the installation by running:
      hadoop version
      

Step 3: Configure Hadoop

  • Edit Configuration Files: Modify the following configuration files located in the etc/hadoop directory:

    • core-site.xml: Set the default filesystem.
    • hdfs-site.xml: Configure HDFS replication factor and storage.
    • mapred-site.xml: Specify the MapReduce framework.
    • yarn-site.xml: Set up YARN resource management details.
  • Example Configuration:

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
    

Step 4: Start Hadoop Services

  • Start HDFS:
    • Format the namenode (do this only once):
      hdfs namenode -format
      
    • Start the HDFS services:
      start-dfs.sh
      
  • Start YARN:
    • Launch the YARN services:
      start-yarn.sh
      

Step 5: Run a Sample MapReduce Job

  • Write a Sample Program: Create a simple MapReduce application in Java or Python.
  • Example Word Count Program:
    public class WordCount {
        public static void main(String[] args) throws Exception {
            // Your MapReduce code here
        }
    }
    
  • Compile and Package: Use javac and jar commands to compile and package your program.
  • Execute the Job:
    hadoop jar your-program.jar input-directory output-directory
    

Conclusion

In this tutorial, you learned the fundamentals of the Hadoop stack for big data, including its components, installation, configuration, and execution of a sample MapReduce job. Next steps include exploring more advanced features of Hadoop, diving deeper into other components like Hive or Pig, and practicing building and running your own big data applications. Getting hands-on experience will significantly enhance your understanding and skills in big data technologies.