spark numeric rdd functions and examples – tutorial 12

Spark provides several descriptive statistics operations on RDDs containing numeric data. Spark’s numeric operations are implemented with a streaming algorithm that allows for building up our model one element at a time. The descriptive statistics are all computed in a single pass over the data and returned as a StatsCounter object by calling stats().

The operation supported are count() to know the number of elements in the RDD, mean() to know the average of the elements, sum() to know the total,max() to know the maximum value,min() to know the minimum value, variance() to know the variance of the elements, sampleVariance() to know the variance of the elements which is computed for a sample, stdev() to know the standard deviation and sampleStdev() to know the sample standard deviation.

If you want to compute only one of these statistics, you can also call the corresponding method directly on an RDD—for example, rdd.mean() or rdd.sum().

Below is an example


import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.DoubleFunction;

public class NumericRdd {

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");

JavaSparkContext jsc = new JavaSparkContext(sparkConf);

JavaRDD<String> rdd = jsc.textFile("C:\\codebase\\scala-project\\inputdata\\movies_data_2");

JavaDoubleRDD pair = rdd.mapToDouble(new DoubleFunction<String>() {

@Override
public double call(String arg0) throws Exception {
// TODO Auto-generated method stub
return Double.parseDouble(arg0.split(",")[1]);
}
});

System.out.println(pair.stats().count());
System.out.println(pair.stats().mean());
System.out.println(pair.stats().sum());
System.out.println(pair.stats().max());
System.out.println(pair.stats().min());
System.out.println(pair.stats().variance());
System.out.println(pair.stats().sampleStdev());
System.out.println(pair.stats().sampleVariance());
System.out.println(pair.stats().stdev());
System.out.println(pair.max());

}

}