reading orc file in spark

We will be using the hadoopFile method of spark context to read the orc file .

Below is the method signature which will get an RDD for a Hadoop file with an arbitrary InputFormat

<K,V> RDD<scala.Tuple2<K,V>> hadoopFile(String path, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions)

Below is the java code which reads a orc file and saves the same in a text file format.

import org.apache.spark.SparkConf;
import scala.Tuple2;

public class ORCReaderDriver {

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");

args = new String[] { "ORC FILE INPUT PATH", "OUTPUT_PATH" };

JavaSparkContext jsc = new JavaSparkContext(sparkConf);

JavaPairRDD<NullWritable, OrcStruct> orcSourceRdd = jsc.hadoopFile(args[0],, NullWritable.class,, 1); Function<Tuple2<NullWritable, OrcStruct>, String>() {

private static final long serialVersionUID = 5454545;

public String call(Tuple2<NullWritable, OrcStruct> orcStruct) throws Exception {
OrcStruct struct = orcStruct._2();
return struct.toString();



Leave a Reply

Your email address will not be published. Required fields are marked *