Since the Getting Started Guide for Spark 1.3.0 on HDP got a little bit outdated, I wanted to write down an updated version on how to get the latest Apache Spark version 1.4.1 running on HDP. For this tutorial I will be using the Hortonworks distribution version 188.8.131.52. However, the setup should also work for the newly releasesed HDP 2.3.
Download and Configure
Prerequisite: Linux environment with preinstalled Hadoop client tooling (ideally a utility machine within your cluster).
To get started, we first need to download the latest Spark distribution from the official download page. Since we want to run our Spark environment on top of YARN we have to select the Pre-built for Hadoop 2.6 and later version.
Now lets untar the downloaded archive
tar xvfz spark-1.4.1-bin-hadoop2.6.tgz
spark-1.4.1-bin-hadoop2.6 since this will be our base directory (
$SPARK_HOME) for the rest of this tutorial.
Before we start Spark for the first time, we need to make some changes to the default configuration file. Spark ships with an excample file in its
$SPARK_HOME/conf directory which we will be extending.
cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
Now lets fire up an editor and add the following lines to the
spark.driver.extraJavaOptions -Dhdp.version=184.108.40.206-1 spark.yarn.am.extraJavaOptions -Dhdp.version=220.127.116.11-1
This will tell YARN which HDP jar files to use when starting the Spark components. In detail, it will use the
hdp.version as a substitution in the classpath variable. To figure out which version of HDP you are using, take a look into the
ls -al /usr/hdp drwxr-xr-x 4 root root 4096 Jun 30 14:23 ./ drwxr-xr-x 12 root root 4096 Jun 30 14:23 ../ drwxr-xr-x 24 root root 4096 Jun 30 14:28 18.104.22.168-1/ drwxr-xr-x 2 root root 4096 Jun 30 14:29 current/
Before we can run our first Spark job on YARN, we need to set some environment variables to tell Spark where the Hadoop/YARN configuration files resides. Since we don't want to export
YARN_CONF_DIR everytime we are going to place those two variables into the
cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
Add the following line to
That's it! Your environment should be now all setup and ready to run Spark jobs/shell on Hadoop YARN.
Running Spark on YARN
Now we can try to run the spark-shell on top of YARN. In the
./bin/spark-shell --master yarn-client
--master yarn-client will tell Spark to run the Spark components as YARN containers on the corresponding NodeManagers.
Welcome to ___ __ / __/__ ___ __ ___/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.4.1 /_/
After the bootstrapping is done, you will see the scala console and can now start interacting with your Spark shell.
Now have fun with Spark on YARN!