In this short article I will show how to create dataframe/dataset in spark sql.
In scala we can use the tuple objects to simulate the row structure if the number of column is less than or equal to 22 . Lets say in our example we want to create a dataframe/dataset of 4 rows , so we will be using Tuple4 class. Below is the example of the same
import org.apache.spark.sql.{DataFrame, SparkSession}
import scala.collection.mutable.ListBuffer
class SparkDataSetFromList {
def getSampleDataFrameFromList(sparkSession: SparkSession): DataFrame = {
import sparkSession.implicits._
var sequenceOfOverview = ListBuffer[(String, String, String,
Integer)]()
sequenceOfOverview += Tuple4("Apollo", "1", "20200901", 1)
sequenceOfOverview += Tuple4("Apollo", "2", "20200901", 0)
sequenceOfOverview += Tuple4("Apollo", "3", "20200901", 1)
sequenceOfOverview += Tuple4("Apollo", "4", "20200901", 0)
sequenceOfOverview += Tuple4("Apollo", "1", "20200902", 1)
sequenceOfOverview += Tuple4("Apollo", "2", "20200902", 0)
sequenceOfOverview += Tuple4("Apollo", "3", "20200902", 1)
sequenceOfOverview += Tuple4("Apollo", "4", "20200902", 1)
sequenceOfOverview += Tuple4("Apollo", "1", "20200903", 0)
sequenceOfOverview += Tuple4("Apollo", "2", "20200903", 0)
sequenceOfOverview += Tuple4("Apollo", "3", "20200903", 0)
sequenceOfOverview += Tuple4("Apollo", "4", "20200903", 1)
sequenceOfOverview += Tuple4("Apollo", "1", "20200904", 0)
sequenceOfOverview += Tuple4("Apollo", "2", "20200904", 0)
sequenceOfOverview += Tuple4("Apollo", "3", "20200904", 1)
sequenceOfOverview += Tuple4("Apollo", "4", "20200904", 1)
val df1 =
sequenceOfOverview.toDF("Hospital", "AccountNumber",
"date", "Visit")
df1
}
}
Let`s print the schema and values to examine the same
import org.apache.spark.sql.SparkSession
object Gateway extends App {
val sparkExcel = new SparkExcel()
lazy val sparkSession: SparkSession = SparkSession
.builder()
.master("local[*]")
.getOrCreate()
val dataframe = sparkExcel.getSampleDataFrame(sparkSession)
dataframe.printSchema()
dataset.show
}
Below is the result
root |-- Hospital: string (nullable = true) |-- AccountNumber: string (nullable = true) |-- date: string (nullable = true) |-- Visit: integer (nullable = true)
+--------+-------------+--------+-----+ |Hospital|AccountNumber| date|Visit| +--------+-------------+--------+-----+ | Apollo| 1|20200901| 1| | Apollo| 2|20200901| 0| | Apollo| 3|20200901| 1| | Apollo| 4|20200901| 0| | Apollo| 1|20200902| 1| | Apollo| 2|20200902| 0| | Apollo| 3|20200902| 1| | Apollo| 4|20200902| 1| | Apollo| 1|20200903| 0| | Apollo| 2|20200903| 0| | Apollo| 3|20200903| 0| | Apollo| 4|20200903| 1| | Apollo| 1|20200904| 0| | Apollo| 2|20200904| 0| | Apollo| 3|20200904| 1| | Apollo| 4|20200904| 1| +--------+-------------+--------+-----+
That’s a brief on how we can create dataframe/dataset from scala list in spark sql.