Performance issues can be categorized into two parts
1. Distribution Performance – program slow due to scheduling , coordination and data distribution.
2. Local Performance – program slow because program is generally slow on a single node.
Tools for debugging
1. Spark UI
Check tasks which are taking maximum time and also check summary metrics in the spark ui and if there is a too much difference in maximum and minimum time taken for each task execution there will straggler .
2. Executor Logs
There can be straggler because of below reasons
1. One of the node is slower than others – To solve this problem set spark.speculation property to true which will make the spark identify the slow tasks looking at the runtime distribution and relaunches those tasks in other nodes.
2. Due to data skew – This can happen when there is one partition which has large amount of data compared to the other partition . To solve this we need to spread this into multiple partitions.
3. Garbage Collection – We can see the GC time taken in the spark ui and if GC is taking most of the time of task execution then we have a problem here.
4 . Performance of the code running each task is slow