Spark Performance Tuning with Scala

Tune Apache Spark for best performance. Master Spark internals and configurations for maximum speed and memory efficiency for your cluster.

 

They say Spark is fast. How do I make the best out of it?

I wrote a lot of Spark jobs over the past few years. Some of my old data pipelines are probably still running as you're reading this. However, my journey with Spark had massive pain. You've probably seen this too.

  • You run 3 big jobs with the same DataFrame, so you try to cache it - but then you look in the UI and it's nowhere to be found.
  • You're finally given the cluster you've been asking for... and then you're like "OK, now how many executors do I pick?".
  • You have a simple job with 1GB of data that takes 5 minutes for 1149 tasks... and 3 hours on the last task.
  • You have a big dataset and you know you're supposed to partition it right, but you can't pick a number between 2 and 50000 because you can find good reasons for both!
  • You search for "caching", "serialization", "partitioning", "tuning" and you only find obscure blog posts and narrow StackOverflow questions.

Unless you have some massive experience or you're a Spark committer, you're probably using 10% of Spark capabilities.

In the Spark Optimization course you learned how to write performant code. It's time to kick the high gear and tune Spark for the best it can be. You are looking at the only course on the web which leverages Spark features and capabilities for the best performance. With the techniques you learn here you will save time, money, energy and massive headaches.

Let's rock.

In this course, we cut the weeds at the root. We dive deep into Spark and understand what tools you have at your disposal - and you might just be surprised at how much leverage you have. You will learn 20+ techniques for boosting Spark performance. Each of them individually can give at least a 2x perf boost for your jobs (some of them even 10x), and I show it on camera.


What's in for you:

  • You'll understand Spark internals to explain how Spark is already pretty darn fast
  • You'll be able to predict in advance if a job will take a long time
  • You'll diagnose hanging jobs, stages and tasks
  • You'll spot and fix data skews
  • You'll make the right performance tradeoffs between speed, memory usage and fault-tolerance
  • You'll be able to configure your cluster with the optimal resources
  • You'll save hours of computation time in this course alone (let alone in prod!)
  • You'll control the parallelism of your jobs with the right partitioning

And some extra perks:

  • You'll have access to the entire code I write on camera (~1400 LOC)
  • You'll be invited to our private Slack room where I'll share latest updates, discounts, talks, conferences, and recruitment opportunities
  • (soon) You'll have access to the takeaway slides
  • (soon) You'll be able to download the videos for your offline view

Skills you'll get:

  • Deep understanding of Spark internals so you can predict job performance
    • stage & task decomposition
    • reading query plans before jobs will run
    • reading DAGs while jobs are running
    • performance differences between the different Spark APIs
    • packaging and deploying a Spark app
    • configuring Spark in 3 different ways
    • understanding the state of the art in Spark internals
    • leveraging Catalyst and Tungsten for massive perf
  • Understanding Spark Memory, Caching and Checkpointing
    • Tuning Spark executor memory zones
    • caching for speedy data reuse
    • making the right tradeoffs between speed, memory usage and fault tolerance
    • using checkpoints when jobs are failing or you can't afford a recomputation
  • Partitioning
    • leveraging repartitions
    • using coalesce to avoid shuffles
    • picking the right number of partitions at a shuffle to match cluster capability
    • using custom partitioners for custom jobs
  • Cluster tuning, fixing problems
    • allocating the right resources in a cluster
    • fixing data skews and straggling tasks with salting
    • fixing serialization problems
    • using the right serializers for free perf improvements

This course is for Scala and Spark programmers who need to improve the run time and memory footprint of their jobs. If you've never done Scala or Spark, this course is not for you. I'll generally recommend that you take the Spark Optimization course first, but it's not a requirement.

 

Your Instructor


Daniel Ciocîrlan
Daniel Ciocîrlan

I'm a software engineer and the founder of Rock the JVM. I started the Rock the JVM project out of love for Scala and the technologies it powers - they are all amazing tools and I want to share as much of my experience with them as I can.

As of February 2024, I've taught Java, Scala, Kotlin and related tech (e.g. Cats, ZIO, Spark) to 100000+ students at various levels and I've held live training sessions for some of the best companies in the industry, including Adobe and Apple. I've also taught university students who now work at Google and Facebook (among others), I've held Hour of Code for 7-year-olds and I've taught more than 35000 kids to code.

I have a Master's Degree in Computer Science and I wrote my Bachelor and Master theses on Quantum Computation. Before starting to learn programming, I won medals at international Physics competitions.


 

Get started now!



 

Take the proven path.

As with the other Rock the JVM courses, the Spark Performance Tuning course will take you through a battle-tested path to Spark proficiency as a data scientist and engineer.

As always, I've

  • deconstructed the complexity of Spark in bite-sized chunks that you can practice in isolation
  • selected the essential concepts and exercises with the appropriate complexity
  • sequenced the topics in increasing order of difficulty so that they "click" along the way
  • applied everything in live code

The value of this course is in showing you different techniques with their direct and immediate effect, so you can later apply them in your own projects. Although the concepts here are sequenced, it might be that you will need some particular techniques first - that's fine. You can also this course as a buffet of techniques, and when you need them, just come back here.

Risk-free: 100% money back guarantee.


If you're not happy with this course, I want you to have your money back. If that happens, email me at [email protected] with a copy of your welcome email and I will refund you the course.

Less than 0.3% of students refunded a course on the entire site, and every payment was returned in less than 72 hours.

Frequently Asked Questions


How long is the course? Will I have time for it?
The course is almost 8 hours in length, with lessons usually 20-30 minutes each, and we write 1000-1500 lines of code. That’s because to learn strategies to boost Spark’s performance, 5-minute lectures or fill-in-the-blanks quizzes won’t give you the necessary results. For the best effectiveness, it’s advised to watch the video lectures in 1-hour chunks at a time.
What does a typical lesson contain?
Code is king, and we write from scratch. In a typical lesson I'll explain some concepts in short, then I'll dive right into the code. We'll write it together, either in the IDE or in the Spark Shell, and we test the effects of the code on either pre-loaded data (which I provide) or with bigger, generated data (whose generator I also provide). Sometimes we'll spend some time in the Spark UI to understand what's going on. A few lectures are atypical in that we're going to go through some thought exercises, but they're no less powerful.
Can I expense this at my company?
A wise company will spend some money on training their folks here rather than spending thousands (or millions) on computing power for nothing.
Is this hard?
There's a reason not everyone is a Spark pro. However, my job is to give you these (otherwise hard) topics in a way that will make you go like "huh, that wasn't so hard".
What if I'm not happy with the course?
If you're not 100% happy with the course, I want you to have your money back. It's a risk-free investment.
Daniel, I can't afford the course. What do I do?
For a while, I told everyone who could not afford a course to email me and I gave them discounts. But then I looked at the stats. Almost ALL the people who actually took the time and completed the course had paid for it in full. So I'm not offering discounts anymore. This is an investment in yourself, which will pay off 100x if you commit. If you find it didn't match your investment, I'll give you a refund.
I have very little Scala or Spark experience. Can I take this course?
Short answer: no. Long answer: we have two recap lessons at the beginning, but they're not a crash course into Scala or Spark and they're not enough if this is the first time you're seeing them. You should take the Scala beginners course and the Spark Essentials course at least. I'll also recommend taking the first Spark Optimization course, but it's not a requirement - this course is standalone.
Why should I need to tune Spark?
Spark comes with a lot of performance tradeoffs that you will have to make while running your jobs. It's important to know what they are and how you can use each configuration or setting, so that you can get the best performance out of your jobs.
What is Spark performance tuning anyway?
To get the optimal memory usage and speed out of your Spark job, you will need to know how Spark works. Tuning Spark means setting the right configurations before running a job, the right resource allocation for your clusters, the right partitioning for your data, and many other aspects.