Spark Optimization with Scala

Go fast or go home. Learn the ins and outs of Spark and get the best out your code.

 

Why the $&*(# is my Spark job running so slow?

I've had my fair share of pain with Spark, and if you're reading this, you've probably seen this too: you run a 4-line job on a gig of data, with two innocent joins, and it takes a bloody hour to run. Or another one: you have an hour long job which was progressing smoothly, until the task 1149/1150 where it hangs, and after two more hours you decide to kill it because you don't know if it's you or a bug in Spark. Usually, PIBKAC - problem is between keyboard and chair - but in desperation, the only idea you have is turn it off and on again.

Then you go like, "hm, maybe my Spark cluster is too small, let me bump some CPU and mem". Then... same thing. Amazon's probably laughing now and you're paying for it. So this has to be the million dollar question.

This is the only course on the web where you can learn how to optimize Spark jobs and master Spark optimization techniques. With the strategies you learn in this Spark optimization course you will save yourself time, headaches and money.

Let's improve that Spark performance.

In this Spark optimization course, we cut the weeds at the root. We dive deep into Spark performance optimization and you will learn how it works under the hood. We'll see that we have incredible leverage, IF we write intelligent code, and you will do exactly that. You will learn 20+ Spark optimization techniques and strategies. Each of them individually can give at least a 2x perf boost for your jobs (some of them even 10x), and I show it on camera.


What you'll learn:

  • You'll understand Spark internals and how Spark works behind the scenes
  • You'll be able to predict in advance if a job will take a long time
  • You'll diagnose performance problems in the Spark UI
  • You'll write smart joins with no shuffles
  • You'll organize your data intelligently so expensive operations are no longer a problem
  • You'll use RDD capabilities for bespoke, high-performance jobs
  • You'll leverage the JVM for high-performance Spark jobs
  • You'll save hours of computation time in this course alone (let alone in prod!)

Plus some extra perks:

  • You'll have access to the entire code I write on camera (~1400 LOC)
  • You'll be invited to our private Slack room where I'll share latest updates, discounts, talks, conferences, and recruitment opportunities
  • (soon) You'll have access to the takeaway slides
  • (soon) You'll be able to download the videos for your offline view
 

Skills you'll get:

  • Deep understanding of Spark internals so you can predict job performance
    • stage & task decomposition
    • reading query plans before jobs will run
    • reading DAGs while jobs are running
    • performance differences between the different Spark APIs
    • packaging and deploying a Spark app
    • configuring Spark in 3 different ways
  • DataFrame and Spark SQL Optimizations
    • understanding join mechanics and why they are expensive
    • writing broadcast joins, or what to do when you join a large and a small DataFrame
    • write pre-join optimizations: column pruning, pre-partitioning
    • bucketing for fast access
    • fixing data skews, "straggling" tasks and OOMs
  • Optimizing RDDs
    • using broadcast joins "manually"
    • cogrouping RDDs in multi-way joins
    • fixing data skews
    • writing optimizations that Spark doesn't generate for us
  • Optimizing key-value RDDs, as most useful transformations need them
    • using the different _byKey methods intelligently
    • reusing JVM objects for when performance is critical and even a few seconds count
    • using the powerful iterator-to-iterator pattern for arbitrary efficient processing

This course is for Scala and Spark programmers who need to improve the run time of their jobs. If you've never done Scala or Spark, this course is not for you.

Your Instructor


Daniel Ciocîrlan
Daniel Ciocîrlan

I'm a software engineer and the founder of Rock the JVM. I started the Rock the JVM project out of love for Scala and the technologies it powers - they are all amazing tools and I want to share as much of my experience with them as I can.

As of February 2024, I've taught Java, Scala, Kotlin and related tech (e.g. Cats, ZIO, Spark) to 100000+ students at various levels and I've held live training sessions for some of the best companies in the industry, including Adobe and Apple. I've also taught university students who now work at Google and Facebook (among others), I've held Hour of Code for 7-year-olds and I've taught more than 35000 kids to code.

I have a Master's Degree in Computer Science and I wrote my Bachelor and Master theses on Quantum Computation. Before starting to learn programming, I won medals at international Physics competitions.


 

Risk-free: 100% money back guarantee.


If you're not happy with this course, I want you to have your money back. If that happens, email me at [email protected] with a copy of your welcome email and I will refund you the course.

Less than 0.3% of students refunded a course on the entire site, and every payment was returned in less than 72 hours.

Take the proven path.

As with the other Rock the JVM courses, this Spark Optimization course will take you through a battle-tested path to Spark proficiency as a data scientist and engineer.

As always, I've

  • deconstructed the complexity of Spark in bite-sized chunks that you can practice in isolation
  • selected the essential concepts and exercises with the appropriate complexity
  • sequenced the topics in increasing order of difficulty so that they "click" along the way
  • applied everything in live code

The value of this course is in showing you different Spark performance optimization techniques with their direct and immediate effect, so you can later apply them in your own projects.

Frequently Asked Questions


How long is the course? Will I have time for it?
The course is a little more than 9 hours in length, with lessons 20-30 minutes each, and we write 1000-1500 lines of code. For complex topics such as Spark optimization techniques, I don't believe in 5-minute lectures or in fill-in-the-blanks quizzes. For best effectiveness, I recommend chunks of 1 hour of learning at a time.
What does a typical lesson contain?
Code is king, and we write from scratch. In a typical lesson I'll explain some concepts in short, then I'll dive right into the code. We'll write it together, either in the IDE or in the Spark Shell, and we test the effects of the code on either pre-loaded data (which I provide) or with bigger, generated data (whose generator I also provide). Sometimes we'll spend some time in the Spark UI to understand what's going on.
Can I expense this at my company?
Of course! A wise company will spend some money on training their folks here rather than spending thousands (or millions) on computing power for nothing.
Is Spark optimization hard to learn?
There's a reason not everyone is a Spark performance pro. However, my job is to give you these (otherwise hard) topics in a way that will make you go like "huh, that wasn't so hard".
What if I'm not happy with the course?
If you're not 100% happy with the course, I want you to have your money back. It's a risk-free investment.
Daniel, I can't afford the course. What do I do?
For a while, I told everyone who could not afford a course to email me and I gave them discounts. But then I looked at the stats. Almost all the people who actually took the time and completed the course had paid for it in full. So I'm not offering discounts anymore. This is an investment in yourself, which will pay off 100x if you commit.
I have very little Scala or Spark experience. Can I take this course?
Short answer: no. Long answer: we have two recap lessons at the beginning, but they're not a crash course into Scala or Spark and they're not enough if this is the first time you're seeing them. You should take the Scala beginners course and the Spark Essentials course at least.
How do you optimize Spark jobs?
This is what I teach you here. We always start with understanding how Spark works, so that we know what we can optimize and what we can't. Depending on the size and structure of your data, the right performance technique to apply may differ. Most of the time, joins are the problem, and there are a few tools we can use, such as broadcast joins, pre-partitioning, bucketing or column pruning. If your data is skewed, that'll disproportionately affect the runtime of your job, so we balance your data. The best but the most tedious Spark optimization strategies involve a deep knowledge of RDDs, which is what I'll also show you in this course.