Getting started with Hadoop

In this tutorial I wanted to share with you my experiences on how to setup and start a local Hadoop environment. The goal here is to create a local environment in which you will be able to run MapReduce jobs and sample YARN applications.

Prerequisites

This tutorial is tailored towards a UNIX-like environment (Linux or Mac OSX). Windows users will have to adopt the corresponding setup steps to their needs.

Overview

Before we start, we first want to take a closer look at the basic components which are part of a Hadoop distribution. This knowledge will give you a better understanding when it comes to configuring the system. So what are the main components in a minimal Hadoop setup? Mainly you have two: HDFS for storage and YARN for resource scheduling and processing. You also get the MapReduce runtime delivered with the standard distribution, but this one won't be covered in this article.

HDFS

The Hadoop File System (HDFS) is the workhorse in a Hadoop system. It is responsible for storing and managing the files in a distributed, redundant and secure way. Replication is here the key concept to ensure reliability.

The image below shows an oversimplified view of the components involved.

In detail

  • NameNode: Manages the location of each block of a given file. The NameNode is the single point of entry to your HDFS system.

  • Secondary NameNode: Every change to the file system is written as a forward log to the Secondary NameNode. The main task of the Secondary NameNode is to keep track of the changes and to rebalance the main index (fsimage) periodically. The idea behind this is: performance. By delegating the task of rebalancing to the Secondary NameNode, the actual NameNode can serve client requests faster.

  • DataNode: The DataNodes hold the actual data blocks in the corresponding HDFS file on their local disk.

This is by far not the complete picture. The detailed interaction workflow (read/write) between the client and NameNode/Datanodes is out of scope of this document and will be covered in a seperate article.

Active/Passive NameNodes

In the Hadoop 2.x distribution, the NameNode can be run in an HA (High Availability) mode. Here the passive NameNode (which is not serving client requests) is handling the rebalancing of the fsimage file. In a fail over scenario, where the main NameNode is not reachable anymore, it will change its status from passive to active and will start serving client requests. We won't cover this setup in this tutorial, because this is a more advanced topic and is only relevent in a productiv cluster environment.

YARN

YARN stands for Yet Another Resource Negotiator and is responsible for allocating resources on the actual compute nodes (NodeManager).

In detail

  • ResourceManager: The ResouceManager is the main component in YARN. It is responsible to allocate resouces on the NodeManagers and assign them to the application which requested them.

  • NodeManager: This is the place where the actual containers will live. The NodeManagers provide compute power to the the application by exposing available resouces (mainly RAM aka slots) to the ResourceManager. The negotiation is done between the client and the RM, the actual containrs (JVM process) will run on the NodeManagers.

Hadoop Versions

Now it's time to choose the Hadoop version which we want to use in our setup. There are currenlty three version tracks available

1.2.X - current stable version, 1.2 release
2.6.X - stable 2.x version
0.23.X - similar to 2.x.x but missing NameNode HA

Since the 1.2.x based version is concidered to be legacy we will cover the 2.x version of Hadoop in this tutorial.

Download & Install

Ok, now lets get started. The main entry point for getting started with Hadoop is the Apache Hadoop Project website. Here you will find all the necessary binaries to install Hadoop.

$ mkdir hadoop
$ cd hadoop
$ wget http://mirror.netcologne.de/apache.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

Now unzip the archive

$ tar xvfz hadoop-2.6.0.tar.gz

Now you should have the basic hadoop distribution on your machine. The next step would be to configure and start the HDFS and YARN deamons.

Configure & Startup

Preparation

First you have to make sure that you can ssh into your localhost without using a password (passwordless access).

If the following command goes through withouth prompting a password input you are all set.

$ ssh localhost

If that is not the case you need to create/add a private/public key pair and add it to your authorized_keys file in your ~/.ssh folder.

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

HDFS on a single Node

Now it is time to configure the HDFS part of your system. The configuration files can be found in the etc/hadoop/ directory (assuming you are in the unzipped hadoop distribution, in our case this would be ~/hadoop/hadoop-2.6.0/etc/hadoop).

$ cd etc/hadoop
$ ls -al 

This should give you a long list of configuration files. For our setup we will only be editing a small subset.

First we need to specify the endpoint of our NameNode in the core-site.xml.

core-site.xml

<configuration>  
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>  

Since we are only running a single-node Hadoop, we will set the global replication factor to 1.

hdfs-site.xml

<configuration>  
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>  

That was it for the configuration part so far. Before we can run any Hadoop commands, we need to set the JAVA_HOME.

$ export JAVA_HOME=/usr/lib/jvm/PATH_TO_JAVA_RUNTIME

You could also export the JAVA_HOME environment variable in the hadoop-env.sh and yarn-env.sh in the configuration directory. This will prevent you from getting JAVA_HOME is not set and could not be found. exceptions when you start the deamons.

Now it is time to format (create) your HDFS.

$ ./bin/hdfs namenode -format

This will create your HDFS in the /tmp directory. To change this directory (in case you don't want to lose your HDFS after a reboot) you need to add the following property to your hdfs-site.xml file.

<property>  
    <name>dfs.datanode.data.dir</name>
    <value>/PATH_TO_MY_HDFS</value>
</property>  

Now lets start the NameNode and DataNode deamon

$ ./sbin/start-dfs.sh

Now lets run some commands on HDFS

./bin/hadoop fs -mkdir /tmp

Will create the diretory /tmp. You can list it via

./bin/hadoop fs -ls /

The NameNode comes with a build in web interface. It is reachable via

http://localhost:50070/  

To stop the NameNode and DataNode deamon run

$ ./sbin/stop-dfs.sh

YARN on a single Node

In the same way you can now configure YARN and the MapReduce component.

mapred-site.xml

<configuration>  
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>  

yarn-site.xml

<configuration>  
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>  

Run the YARN startup script to instanciate the YARN deamon.

$ ./sbin/start-yarn.sh

YARN comes also with it's own web interface. It can be reached at

http://localhost:8088/  

Finally we want to shutdown the YARN deamon via.

$ ./sbin/start-yarn.sh

Next Steps

You should now have (hopefully successful) setup a local single-node Hadoop system. The next steps would be to write a simple MapReduce job/YARN application. I'm planning to cover those topics in a future tutorial.

Andreas Fritzler

Data Jedi | Cloud and Big Data Expert | Machine Learning Enthusiast | Deep Learning Fanatic @SAP Opinions are my own