noc19-cs33 Lecture 3-Hadoop Stack For Big Data
Table of Contents
Introduction
This tutorial provides a comprehensive overview of the Hadoop stack for big data, based on the Lecture 3 of the noc19-cs33 series from IIT Kanpur. It covers essential components of the Hadoop ecosystem, their functionalities, and how they work together to manage and analyze large datasets. Understanding these concepts is crucial for anyone looking to leverage big data technologies in their projects.
Step 1: Understand Hadoop Framework
- Hadoop Overview: Familiarize yourself with Hadoop as an open-source framework designed for distributed storage and processing of big data.
- Components of Hadoop:
- Hadoop Distributed File System (HDFS): A scalable and fault-tolerant file system that stores data across multiple machines.
- MapReduce: A programming model for processing large datasets with a distributed algorithm on a cluster.
- YARN (Yet Another Resource Negotiator): A resource management layer that manages computing resources in clusters and allocates them to various applications.
Step 2: Install Hadoop
- Prerequisites:
- Ensure Java is installed on your system.
- Download the Hadoop binary from the official Apache Hadoop website.
- Installation Steps:
- Extract the downloaded Hadoop tar file.
- Set up environment variables in your
.bashrc
or.bash_profile
:export HADOOP_HOME=/path/to/hadoop export PATH=$PATH:$HADOOP_HOME/bin
- Verify the installation by running:
hadoop version
Step 3: Configure Hadoop
-
Edit Configuration Files: Modify the following configuration files located in the
etc/hadoop
directory:core-site.xml
: Set the default filesystem.hdfs-site.xml
: Configure HDFS replication factor and storage.mapred-site.xml
: Specify the MapReduce framework.yarn-site.xml
: Set up YARN resource management details.
-
Example Configuration:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Step 4: Start Hadoop Services
- Start HDFS:
- Format the namenode (do this only once):
hdfs namenode -format
- Start the HDFS services:
start-dfs.sh
- Format the namenode (do this only once):
- Start YARN:
- Launch the YARN services:
start-yarn.sh
- Launch the YARN services:
Step 5: Run a Sample MapReduce Job
- Write a Sample Program: Create a simple MapReduce application in Java or Python.
- Example Word Count Program:
public class WordCount { public static void main(String[] args) throws Exception { // Your MapReduce code here } }
- Compile and Package: Use
javac
andjar
commands to compile and package your program. - Execute the Job:
hadoop jar your-program.jar input-directory output-directory
Conclusion
In this tutorial, you learned the fundamentals of the Hadoop stack for big data, including its components, installation, configuration, and execution of a sample MapReduce job. Next steps include exploring more advanced features of Hadoop, diving deeper into other components like Hive or Pig, and practicing building and running your own big data applications. Getting hands-on experience will significantly enhance your understanding and skills in big data technologies.