Speeding up Spark on YARN Startup Time

Getting Apache Spark running on YARN is pretty easy and straight forward. I covered in a previous post how to get started with Spark 1.3.0 on HDP 2.2. However you might have noticed, that the bootstrapping of a Spark environment on YARN might take a couple of seconds. This is due to the fact, that the Spark jars have to be copied to HDFS first, before the containers can be created on the NodeManager machines.

The uploading step is reflected in the following log snipped when you run the ./bin/spark-shell --master yarn-client command

INFO yarn.Client: Uploading resource file:/$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://NAMENODE:8020/user/USER_ID/.sparkStaging/application_1427707672069_0012/spark-assembly-1.3.0-hadoop2.4.0.jar  

To avoid the upload of the spark-assembly everytime you run a spark job/shell, you need first to copy the corresponding jar file to a location of your choice inside HDFS.

hadoop fs -put $SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar /apps/  

Here, we uploaded the jar file into the /apps/ folder.

Now you have to export the SPARK_JAR environment variable to tell Spark where the jar file is located.

export SPARK_JAR=hdfs://NAMENODE:PORT/apps/spark-assembly-1.3.0-hadoop2.4.0.jar  

UPDATE April 8th 2015

With Spark 1.3.0 the SPARK_JAR property has been deprecated. Instead of SPARK_JAR you should use the spark.yarn.jar = hdfs://NAMENODE:PORT/apps/spark-assembly-1.3.0-hadoop2.4.0.jar property in your spark-defaults.conf configuration file.

That's it! You can now run e.g. the spark-shell

./bin/spark-shell --master yarn-client

without the uploading step.

Andreas Fritzler

Data Jedi | Cloud and Big Data Expert | Machine Learning Enthusiast | Deep Learning Fanatic @SAP Opinions are my own