spark read avro data from s3

In this article i will demonstrate how to read and write avro data in spark from amazon s3. We will load the data from s3 into a dataframe and then write the same data back to s3 in avro format.

If the project is built using maven below is the dependency that needs to be added


<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.455</version>
<scope>compile</scope>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/com.databricks/spark-avro -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.1.1</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.1</version>
</dependency>

If we are deploying the code on emr or on any other hadoop distribution like cloudera, we need to add databricks jar files into the classpath . We can also pass the databricks jar files using the spark submit command as below


spark-submit --master yarn --deploy-mode client --class com.test.ReadWriteAvroDataS3 --name "avro_s3" --total-executor-cores 2 --executor-memory 1g --jars /home/adarsh/spark-avro_2.11-4.0.0.jar,/home/adarsh/spark-avro_2.11-2.4.0.jar test.jar

Below is the spark code which loads the data into dataframe from s3, and writes the data back to s3.


import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ReadWriteS3Avro {

public static void main(String[] args) {

SparkSession spark = SparkSession.builder().master("local")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2").getOrCreate();

JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());

Dataset<Row> dF = spark.read().format("com.databricks.spark.avro").load(
"s3a://awsAccessKeyId:awsSecretAccessKey@bucketname/avrodata/devicetestdata.avro");

dF.write().format("com.databricks.spark.avro")
.save("s3a://awsAccessKeyId:awsSecretAccessKey@bucketname/avrodata");

dF.show();

}

}

1 thought on “spark read avro data from s3”

Comments are closed.