checkpointing and fault tolerance in spark streaming

Checkpointing is the main mechanism that needs to be set up for fault tolerance in Spark Streaming. It allows Spark Streaming to periodically save data about the application to a reliable storage system, such as HDFS or Amazon S3, for use in recovering.

Specifically, checkpointing serves two purposes

1. Limiting the state that must be recomputed on failure. Spark Streaming can recompute state using the lineage graph of transformations, but checkpointing controls how far back it must go.

2. Providing fault tolerance for the driver. If the driver program in a streaming application crashes, you can launch it again and tell it to recover from a checkpoint, in which case Spark Streaming will read how far the previous run of the
program got in processing the data and take over from there.

For these reasons, checkpointing is important to set up in any production streaming application. You can set it as below


JavaStreamingContext jstream = new JavaStreamingContext(jsc, new Duration(5000));
jstream.checkpoint("C:\\codebase\\scala-project\\Checkdatastream\\movies");

Note that even in local mode, Spark Streaming will complain if you try to run a stateful operation without checkpointing enabled. In that case, you can pass a local filesystem path for checkpointing. But in any production setting, you should use a replicated system such as HDFS or S3.

Driver Fault Tolerance

Tolerating failures of the driver node requires a special way of creating our StreamingContext, which takes in the checkpoint directory. Instead of simply calling new StreamingContext, we need to use the StreamingContext.getOrCreate() function.


JavaStreamingContext jstream=JavaStreamingContext.getOrCreate("C:\\codebase\\scala-project\\Checkdatastream\\movies",
new Function0<JavaStreamingContext>() {

private static final long serialVersionUID = 1343434;

@Override
public JavaStreamingContext call() throws Exception {

JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaStreamingContext jstream = new JavaStreamingContext(jsc, new Duration(5000));
jstream.checkpoint("C:\\codebase\\scala-project\\Checkdatastream\\movies");
return jstream;

}
});

When this code is run the first time, assuming that the checkpoint directory does not yet exist, the StreamingContext will be created when you call the getOrCreate function. In the function, we should set the checkpoint directory. After the driver fails, if you restart it and run this code again, getOrCreate() will reinitialize a Streaming-Context from the checkpoint directory and resume processing.

Worker Fault Tolerance

For failure of a worker node, Spark Streaming uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the Spark workers. All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.

Receiver Fault Tolerance

The fault tolerance of the workers running the receivers is another important consideration. In such a failure, Spark Streaming restarts the failed receivers on other nodes in the cluster. However, whether it loses any of the received data depends on the nature of the source and the implementation of the receiver whether it updates the source about received data or not.

In general, receivers provide the following guarantees

1. All data read from a reliable filesystem (e.g., with StreamingContext.hadoop Files) is reliable, because the underlying filesystem is replicated. Spark Streaming will remember which data it processed in its checkpoints and will pick up again where it left off if your application crashes.

2.For unreliable sources such as Kafka, push-based Flume, or Twitter, Spark replicates the input data to other nodes, but it can briefly lose data if a receiver task is down.

Processing Guarantees

Due to Spark Streaming’s worker fault-tolerance guarantees, it can provide exactly once semantics for all transformations—even if a worker fails and some data gets reprocessed, the final transformed result (that is, the transformed RDDs) will be the same as if the data were processed exactly once.