28.03.2018 by Kirill on Technical

Overcoming long Spark job runtime on small datasets

If you are dealing with relatively low datasets < 1M entries (and you just have to use Spark for some reasons), significant speedup can be achieved with tuning (lowering) number of partitions.

Basically, setting `spark.default.parallelism` param to number of cores and `spark.sql.shuffle.partitions` to something like 20 (instead of default 200), will allow you to receive significant speedup, since Spark won’t lose time on shuffling RDDs and generating large number of tasks.

Source.

Another useful link.

10.03.2018 by Kirill on Technical

Migrating code from Zeppelin to Spark

When you have shiny Zeppelin application, which runs smoothly and does what it supposed to do, you start transferring your code into Spark environment to use it in production. If you are novice in Hadoop environment (like me), you might encounter a couple of tasks, required to be solved before you will celebrate project launch.

Basically, it can be broken down into easy chunks:

Launching spark-submit with test class.
Adding main class and Spark context initialization.
Building fat jar (which includes all the libraries).
Launching a job with a spark-submit.

… →

13.12.2015 by Kirill on Uncategorized

Альтернативы Mailbox или новый лучший почтовый клиент для iOS

7 декабря Dropbox объявил о том, что прекращает поддержку и развитие двух побочных проектов — Mailbox и Carousel.

Mailbox — был и остаётся лучшим клиентом для почты, в котором было очень удобно читать и просматривать письма. Смахивание (свайпы вправо и влево) для того, чтобы прочитать письмо, отложить или положить в отдельную папку в нём появились одними из первых и были потом скопированы в большинство других приложений. Про причины закрытия уже подробно написали (1, 2 заметки) на vc.ru (бывший ЦП). Но остаётся открытым вопрос — чем заменить полюбившийся многим клиент.

… →