Checkpointing is the main mechanism that needs to be set up for fault tolerance in Spark Streaming. It allows Spark…
Stateful transformations are operations on DStreams that track data across time that is, some data from previous batches is used…
Stateless transformations like map(), flatMap(), filter(), repartition(), reduceByKey(), groupByKey() are simple RDD transformations being applied on every batch. Keep in…
Spark Streaming provides an abstraction called DStreams, or discretized streams which is build on top of RDD. A DStream is…
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational…
The default data source used will be parquet unless otherwise configured by spark.sql.sources.default for all operations. We can use the…
User-defined aggregations for strongly typed Datasets revolve around the Aggregator abstract class. Lets write a user defined function to calculate…