How to read write data from Azure Blob Storage with Apache Spark

In this short article, we will write a program in spark scala to read write data from Azure Blob Storage with Apache Spark. In the below code the storageAccountName refers to the Storage Account in the Azure and storageKeyValue refers to the access key to authenticate your application when making requests to this Azure storage account. The code connects to the azure blob storage based on the accountName,key and container name and reads a csv file into the dataframe and writes the dataframe back into the azure blob storage as a json file.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object ReadWriteDataFromAzureBlobWithSpark extends App {

    val input_blob_path = "Enter input azure blob path here"
    val output_blob_path = "Enter output azure blob path here"
    val storageAccountName = "Enter_Your_Storage_Account_Name"
    val storageKeyValue = "Enter_Your_Storage_Account_Key"  
    val containerName = "Enter_Your_containerName"

    val sparkSession = SparkSession.builder()
      .master("local")
      .appName("ReadWriteDataFromAzureBlobWithSpark")
      .config(
        s"fs.azure.account.key.${storageAccountName}.blob.core.windows.net",
        storageKeyValue)
      .getOrCreate()

    val df = sparkSession.read.option("delimiter", "|")
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(s"wasbs://${containerName}@${storageAccountName}.blob.core.windows.net/${input_blob_path}")

    df.write.json(s"wasbs://${containerName}@{storageAccountName}.blob.core.windows.net/${output_blob_path}")


}

 

Leave a Reply

Your email address will not be published. Required fields are marked *