They say Spark is fast. How do I make the best out of it?
I wrote a lot of Spark jobs over the past few years. Some of my old data pipelines are probably still running as you’re reading this. However, my journey with Spark had massive pain. You’ve probably seen this too.
- You run 3 big jobs with the same DataFrame, so you try to cache it - but then you look in the UI and it’s nowhere to be found.
- You’re finally given the cluster you’ve been asking for… and then you’re like “OK, now how many executors do I pick?”.
- You have a simple job with 1GB of data that takes 5 minutes for 1149 tasks… and 3 hours on the last task.
- You have a big dataset and you know you’re supposed to partition it right, but you can’t pick a number between 2 and 50000 because you can find good reasons for both!
- You search for “caching”, “serialization”, “partitioning”, “tuning” and you only find obscure blog posts and narrow StackOverflow questions.
Unless you have some massive experience or you’re a Spark committer, you’re probably using 10% of Spark capabilities.
In the Spark Optimization course you learned how to write performant code. It’s time to kick the high gear and tune Spark for the best it can be. You are looking at the only course on the web which leverages Spark features and capabilities for the best performance. With the techniques you learn here you will save time, money, energy and massive headaches.
Let’s rock.
In this course, we cut the weeds at the root. We dive deep into Spark and understand what tools you have at your disposal - and you might just be surprised at how much leverage you have. You will learn 20+ techniques for boosting Spark performance. Each of them individually can give at least a 2x perf boost for your jobs (some of them even 10x), and I show it on camera.