In the Databricks notebook, the SparkSession is created for you when you start a cluster with databricks runtime . The spark session in databricks is accessible through a variable called spark and it encapsulates a spark context.
In this article I will walk you through pattern in scala which can be used to run a spark application in local and also in production which is deployed in databricks runtime. Basically the idea is to use the spark session created by databricks runtime in production deployments along with a ability to run the spark application in local mode as well for testing and debugging purpose.
Lets create a trait Spark with a lazy val sparkSession which will be executed only when it is accessed for the first time.
import org.apache.spark.sql.SparkSession trait Spark { lazy val sparkSession: SparkSession = SparkSession .builder() .master("local[*]") .appName("Spark local app") .config("spark.sql.session.timeZone", "UTC") .config("spark.driver.extraJavaOptions", "-Duser.timezone=UTC") .config("spark.executor.extraJavaOptions", "-Duser.timezone=UTC") .getOrCreate() }
Lets define a Gateway class which will be a entry point to run the application in local and also in production
import org.apache.spark.sql.SparkSession class Gateway extends Spark { def run(param1: String, param2: String, sparkSessionOpt: Option[SparkSession] = None): Unit = { val sparkSession = sparkSessionOpt.getOrElse(sparkSession) val extractor = new DataExtractor extractor.extractData(param1, param2, sparkSession) } } object Gateway { def main(args: Array[String]): Unit = { run("PARAM1", "PARAM2") } def run(organization: String, correlation: String, sparkSessionOpt: Option[SparkSession] = None): Unit = { val gateway = new Gateway gateway.run(organization, correlation, sparkSessionOpt) } }
As we can see above if the gateway is triggered from local the main method will be executed, which calls the run method and the same is delegated to the Gateway class run method passing a NONE sparkSessionOpt which results in getting the spark session from Spark triat which is configured to run in local mode.
If we are executing the same in databricks notebook we will be calling this as below, here we are using the spark variable which is created by databricks runtime and as sparkSessionOpt is initialized with a valid spark session the sparkSession from trait Spark will never be used.
Gateway.run("PARAM1","PARAM2",spark)
That`s a quick overview on the design pattern that can be used in databricks notebook for spark application.