Tuning Spark often simply means changing the Spark application’s runtime configuration. The primary configuration mechanism in Spark is the SparkConf…
In distributed mode, Spark uses a master/slave architecture with one central coordinator and many distributed workers. The central coordinator is…
Spark provides several descriptive statistics operations on RDDs containing numeric data. Spark’s numeric operations are implemented with a streaming algorithm…
Working with data on a per partition basis allows us to avoid redoing set up work for each data item.…
When we normally pass functions to Spark, such as a map() function or a condition for filter(), they can use…
While Spark’s HashPartitioner and RangePartitioner are well suited to many use cases, Spark also allows you to tune how an…
Sometimes we want a different sort order entirely, and to support this we can provide our own comparison function. In…