In this short article I will show how to create dataframe/dataset in spark sql.
In scala we can use the tuple objects to simulate the row structure if the number of column is less than or equal to 22 . Lets say in our example we want to create a dataframe/dataset of 4 rows , so we will be using Tuple4 class. Below is the example of the same
import org.apache.spark.sql.{DataFrame, SparkSession} import scala.collection.mutable.ListBuffer class SparkDataSetFromList { def getSampleDataFrameFromList(sparkSession: SparkSession): DataFrame = { import sparkSession.implicits._ var sequenceOfOverview = ListBuffer[(String, String, String, Integer)]() sequenceOfOverview += Tuple4("Apollo", "1", "20200901", 1) sequenceOfOverview += Tuple4("Apollo", "2", "20200901", 0) sequenceOfOverview += Tuple4("Apollo", "3", "20200901", 1) sequenceOfOverview += Tuple4("Apollo", "4", "20200901", 0) sequenceOfOverview += Tuple4("Apollo", "1", "20200902", 1) sequenceOfOverview += Tuple4("Apollo", "2", "20200902", 0) sequenceOfOverview += Tuple4("Apollo", "3", "20200902", 1) sequenceOfOverview += Tuple4("Apollo", "4", "20200902", 1) sequenceOfOverview += Tuple4("Apollo", "1", "20200903", 0) sequenceOfOverview += Tuple4("Apollo", "2", "20200903", 0) sequenceOfOverview += Tuple4("Apollo", "3", "20200903", 0) sequenceOfOverview += Tuple4("Apollo", "4", "20200903", 1) sequenceOfOverview += Tuple4("Apollo", "1", "20200904", 0) sequenceOfOverview += Tuple4("Apollo", "2", "20200904", 0) sequenceOfOverview += Tuple4("Apollo", "3", "20200904", 1) sequenceOfOverview += Tuple4("Apollo", "4", "20200904", 1) val df1 = sequenceOfOverview.toDF("Hospital", "AccountNumber", "date", "Visit") df1 } }
Let`s print the schema and values to examine the same
import org.apache.spark.sql.SparkSession object Gateway extends App { val sparkExcel = new SparkExcel() lazy val sparkSession: SparkSession = SparkSession .builder() .master("local[*]") .getOrCreate() val dataframe = sparkExcel.getSampleDataFrame(sparkSession) dataframe.printSchema() dataset.show }
Below is the result
root |-- Hospital: string (nullable = true) |-- AccountNumber: string (nullable = true) |-- date: string (nullable = true) |-- Visit: integer (nullable = true)
+--------+-------------+--------+-----+ |Hospital|AccountNumber| date|Visit| +--------+-------------+--------+-----+ | Apollo| 1|20200901| 1| | Apollo| 2|20200901| 0| | Apollo| 3|20200901| 1| | Apollo| 4|20200901| 0| | Apollo| 1|20200902| 1| | Apollo| 2|20200902| 0| | Apollo| 3|20200902| 1| | Apollo| 4|20200902| 1| | Apollo| 1|20200903| 0| | Apollo| 2|20200903| 0| | Apollo| 3|20200903| 0| | Apollo| 4|20200903| 1| | Apollo| 1|20200904| 0| | Apollo| 2|20200904| 0| | Apollo| 3|20200904| 1| | Apollo| 4|20200904| 1| +--------+-------------+--------+-----+
That’s a brief on how we can create dataframe/dataset from scala list in spark sql.