In this article i will demonstrate how to read and write avro data in spark from amazon s3. We will load the data from s3 into a dataframe and then write the same data back to s3 in avro format.
If the project is built using maven below is the dependency that needs to be added
<dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-java-sdk</artifactId> <version>1.11.455</version> <scope>compile</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.3.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.0</version> </dependency> <!-- https://mvnrepository.com/artifact/com.databricks/spark-avro --> <dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.11</artifactId> <version>4.0.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-aws</artifactId> <version>3.1.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs --> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>3.1.1</version> </dependency>
If we are deploying the code on emr or on any other hadoop distribution like cloudera, we need to add databricks jar files into the classpath . We can also pass the databricks jar files using the spark submit command as below
spark-submit --master yarn --deploy-mode client --class com.test.ReadWriteAvroDataS3 --name "avro_s3" --total-executor-cores 2 --executor-memory 1g --jars /home/adarsh/spark-avro_2.11-4.0.0.jar,/home/adarsh/spark-avro_2.11-2.4.0.jar test.jar
Below is the spark code which loads the data into dataframe from s3, and writes the data back to s3.
import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class ReadWriteS3Avro { public static void main(String[] args) { SparkSession spark = SparkSession.builder().master("local") .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2").getOrCreate(); JavaSparkContext sc = new JavaSparkContext(spark.sparkContext()); Dataset<Row> dF = spark.read().format("com.databricks.spark.avro").load( "s3a://awsAccessKeyId:awsSecretAccessKey@bucketname/avrodata/devicetestdata.avro"); dF.write().format("com.databricks.spark.avro") .save("s3a://awsAccessKeyId:awsSecretAccessKey@bucketname/avrodata"); dF.show(); } }
nice information