Apache Spark 1.4.1 on YARN on HDP

Since the Getting Started Guide for Spark 1.3.0 on HDP got a little bit outdated, I wanted to write down an updated version on how to get the latest Apache Spark version 1.4.1 running on HDP. For this tutorial I will be using the Hortonworks distribution version 2.2.6.3. However, the setup should also work for the newly releasesed HDP 2.3.

Download and Configure

Prerequisite: Linux environment with preinstalled Hadoop client tooling (ideally a utility machine within your cluster).

To get started, we first need to download the latest Spark distribution from the official download page. Since we want to run our Spark environment on top of YARN we have to select the Pre-built for Hadoop 2.6 and later version.

Now lets untar the downloaded archive

tar xvfz spark-1.4.1-bin-hadoop2.6.tgz  

cd into spark-1.4.1-bin-hadoop2.6 since this will be our base directory ($SPARK_HOME) for the rest of this tutorial.

cd spark-1.4.1-bin-hadoop2.6  

Before we start Spark for the first time, we need to make some changes to the default configuration file. Spark ships with an excample file in its $SPARK_HOME/conf directory which we will be extending.

cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf  

Now lets fire up an editor and add the following lines to the spark-defaults.conf file

spark.driver.extraJavaOptions -Dhdp.version=2.2.6.3-1  
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.6.3-1  

This will tell YARN which HDP jar files to use when starting the Spark components. In detail, it will use the hdp.version as a substitution in the classpath variable. To figure out which version of HDP you are using, take a look into the /usr/hdp directory.

ls -al /usr/hdp

drwxr-xr-x  4 root root 4096 Jun 30 14:23 ./  
drwxr-xr-x 12 root root 4096 Jun 30 14:23 ../  
drwxr-xr-x 24 root root 4096 Jun 30 14:28 2.2.6.3-1/  
drwxr-xr-x  2 root root 4096 Jun 30 14:29 current/  

Before we can run our first Spark job on YARN, we need to set some environment variables to tell Spark where the Hadoop/YARN configuration files resides. Since we don't want to export HADOOP_CONF_DIR or YARN_CONF_DIR everytime we are going to place those two variables into the spark-env.sh.

cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh  

Add the following line to spark-env.sh

HADOOP_CONF_DIR=/etc/hadoop/conf  

That's it! Your environment should be now all setup and ready to run Spark jobs/shell on Hadoop YARN.

Running Spark on YARN

Now we can try to run the spark-shell on top of YARN. In the $SPARK_HOME directory

./bin/spark-shell --master yarn-client

Here --master yarn-client will tell Spark to run the Spark components as YARN containers on the corresponding NodeManagers.

Welcome to  
      ___                __
     / __/__  ___ __ ___/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.1
      /_/

After the bootstrapping is done, you will see the scala console and can now start interacting with your Spark shell.

scala>  

Now have fun with Spark on YARN!

Andreas Fritzler

Data Jedi | Cloud and Big Data Expert | Machine Learning Enthusiast | Deep Learning Fanatic @SAP Opinions are my own