Migrating code from Zeppelin to Spark

When you have shiny Zeppelin application, which runs smoothly and does what it supposed to do, you start transferring your code into Spark environment to use it in production. If you are novice in Hadoop environment (like me), you might encounter a couple of tasks, required to be solved before you will celebrate project launch.

Basically, it can be broken down into easy chunks:

  1. Launching spark-submit with test class.
  2. Adding main class and Spark context initialization.
  3. Building fat jar (which includes all the  libraries).
  4. Launching a job with a spark-submit.

Running example should be undependable of environment, so we can rely on HDP documentation:

root@sandbox# export SPARK_MAJOR_VERSION=2 
root@sandbox# cd /usr/hdp/current/spark2-client 
root@sandbox spark2-client# su spark 
spark@sandbox spark2-client$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/jars/spark-examples*.jar 10

If you see an output after a while, that means that at least your cluster is running smooth. It also better to export variable in some .bash_profile or .bashrc file.

As the official documentation suggests, you can build a fat jar with sbt scala package. But make sure that Spark context gets initialized inside main function. If you have any errors, this may result in error:

Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration

If you get any sort of Permission denied error, it can be solved by switching to another HDFS user  (this topic might help).