Spark Performance Tuning with Scala
Tune Apache Spark for best performance. Master Spark internals and configurations for maximum speed and memory efficiency for your cluster.
They say Spark is fast. How do I make the best out of it?
I wrote a lot of Spark jobs over the past few years. Some of my old data pipelines are probably still running as you're reading this. However, my journey with Spark had massive pain. You've probably seen this too.
- You run 3 big jobs with the same DataFrame, so you try to cache it - but then you look in the UI and it's nowhere to be found.
- You're finally given the cluster you've been asking for... and then you're like "OK, now how many executors do I pick?".
- You have a simple job with 1GB of data that takes 5 minutes for 1149 tasks... and 3 hours on the last task.
- You have a big dataset and you know you're supposed to partition it right, but you can't pick a number between 2 and 50000 because you can find good reasons for both!
- You search for "caching", "serialization", "partitioning", "tuning" and you only find obscure blog posts and narrow StackOverflow questions.
Unless you have some massive experience or you're a Spark committer, you're probably using 10% of Spark capabilities.
In the Spark Optimization course you learned how to write performant code. It's time to kick the high gear and tune Spark for the best it can be. You are looking at the only course on the web which leverages Spark features and capabilities for the best performance. With the techniques you learn here you will save time, money, energy and massive headaches.
In this course, we cut the weeds at the root. We dive deep into Spark and understand what tools you have at your disposal - and you might just be surprised at how much leverage you have. You will learn 20+ techniques for boosting Spark performance. Each of them individually can give at least a 2x perf boost for your jobs (some of them even 10x), and I show it on camera.
What's in for you:
- You'll understand Spark internals to explain how Spark is already pretty darn fast
- You'll be able to predict in advance if a job will take a long time
- You'll diagnose hanging jobs, stages and tasks
- You'll spot and fix data skews
- You'll make the right performance tradeoffs between speed, memory usage and fault-tolerance
- You'll be able to configure your cluster with the optimal resources
- You'll save hours of computation time in this course alone (let alone in prod!)
- You'll control the parallelism of your jobs with the right partitioning
And some extra perks:
- You'll have access to the entire code I write on camera (~1400 LOC)
- You'll be invited to our private Slack room where I'll share latest updates, discounts, talks, conferences, and recruitment opportunities
- (soon) You'll have access to the takeaway slides
- (soon) You'll be able to download the videos for your offline view
Skills you'll get:
Deep understanding of Spark internals so you can predict job performance
- stage & task decomposition
- reading query plans before jobs will run
- reading DAGs while jobs are running
- performance differences between the different Spark APIs
- packaging and deploying a Spark app
- configuring Spark in 3 different ways
- understanding the state of the art in Spark internals
- leveraging Catalyst and Tungsten for massive perf
Understanding Spark Memory, Caching and Checkpointing
- Tuning Spark executor memory zones
- caching for speedy data reuse
- making the right tradeoffs between speed, memory usage and fault tolerance
- using checkpoints when jobs are failing or you can't afford a recomputation
- leveraging repartitions
- using coalesce to avoid shuffles
- picking the right number of partitions at a shuffle to match cluster capability
- using custom partitioners for custom jobs
Cluster tuning, fixing problems
- allocating the right resources in a cluster
- fixing data skews and straggling tasks with salting
- fixing serialization problems
- using the right serializers for free perf improvements
This course is for Scala and Spark programmers who need to improve the run time and memory footprint of their jobs. If you've never done Scala or Spark, this course is not for you. I'll generally recommend that you take the Spark Optimization course first, but it's not a requirement.
I'm a software engineer and the founder of Rock the JVM. I started the Rock the JVM project out of love for Scala and the technologies it powers - they are all amazing tools and I want to share as much of my experience with them as I can.
As of February 2024, I've taught Java, Scala, Kotlin and related tech (e.g. Cats, ZIO, Spark) to 100000+ students at various levels and I've held live training sessions for some of the best companies in the industry, including Adobe and Apple. I've also taught university students who now work at Google and Facebook (among others), I've held Hour of Code for 7-year-olds and I've taught more than 35000 kids to code.
I have a Master's Degree in Computer Science and I wrote my Bachelor and Master theses on Quantum Computation. Before starting to learn programming, I won medals at international Physics competitions.
Get started now!
Take the proven path.
As with the other Rock the JVM courses, the Spark Performance Tuning course will take you through a battle-tested path to Spark proficiency as a data scientist and engineer.
As always, I've
- deconstructed the complexity of Spark in bite-sized chunks that you can practice in isolation
- selected the essential concepts and exercises with the appropriate complexity
- sequenced the topics in increasing order of difficulty so that they "click" along the way
- applied everything in live code
The value of this course is in showing you different techniques with their direct and immediate effect, so you can later apply them in your own projects. Although the concepts here are sequenced, it might be that you will need some particular techniques first - that's fine. You can also this course as a buffet of techniques, and when you need them, just come back here.
Risk-free: 100% money back guarantee.
If you're not happy with this course, I want you to have your money back. If that happens, email me at [email protected] with a copy of your welcome email and I will refund you the course.
Less than 0.3% of students refunded a course on the entire site, and every payment was returned in less than 72 hours.