Overcoming long Spark job runtime on small datasets

If you are dealing with relatively low datasets < 1M entries (and you just have to use Spark for some reasons), significant speedup can be achieved with tuning (lowering) number of partitions.

Basically, setting `spark.default.parallelism` param to number of cores and `spark.sql.shuffle.partitions` to something like 20 (instead of default 200), will allow you to receive significant speedup, since Spark won’t lose time on shuffling RDDs and generating large number of tasks.

Source.

Another useful link.

 

Kirill