spark read avro file from hdfs example

December, 2017 adarsh 1 Comment

To load avro data in spark we need few additional jars and in the below example we are using the libraries from the com.databricks. If we are using maven to build our project we can use the below pom.xml file.


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.timepass.techies.spark</groupId>
<artifactId>spark</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>spark-project</name>
<description>spark-project</description>
<repositories>

<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>

</repositories>
<dependencies>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.6.0</version>
</dependency>

<artifactId>spark-avro_2.10</artifactId>
<version>1.0.0</version> </dependency> -->

<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.10</artifactId>
<version>1.1.0-cdh5.9.1</version>
</dependency>

</dependencies>

<build>
<plugins>
<!-- Maven shade plug-in that creates uber JARs -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

Once we have all the dependent jars available in the classpath we can run the below code to load the avro data using spark dataframe api. If the avro data file name does not have .avro extension then we need to add below code which will make sure spark will process the data without .avro extension.


jsc.hadoopConfiguration().set("avro.mapred.ignore.inputs.without.extension", "false");


import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;

public class LoadAvroData {

public static void main(String[] args) {

args = new String[] { "Input Path" };

SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");

JavaSparkContext jsc = new JavaSparkContext(sparkConf);

jsc.hadoopConfiguration().set("avro.mapred.ignore.inputs.without.extension", "false");

SQLContext sql = new SQLContext(jsc);

DataFrame uii_inventory = sql.read().format("com.databricks.spark.avro").load(args[0]);

uii_inventory.show();

}

}

Related

1 thought on “spark read avro file from hdfs example”

Satya Pothuri says:

February, 2018 at 10:52 am

Hi,
This blog is really helpful to understand the things in detail manner with practical examples. Please do notify me for upcoming blogs .

Comments are closed.

Copyright © 2017 Time Pass Techies