Apache Spark Performance Tuning with Scala

Name: Apache Spark Performance Tuning with Scala
Price: 75 USD

Learn how to optimize Apache Spark with Scala for peak performance with our comprehensive course. Master Spark internals and configurations to enhance speed and memory efficiency for your cluster.

Enroll Now

Goal

They say Spark is fast. How do I make the best out of it?

I wrote a lot of Spark jobs over the past few years. Some of my old data pipelines are probably still running as you’re reading this. However, my journey with Spark had massive pain. You’ve probably seen this too.

You run 3 big jobs with the same DataFrame, so you try to cache it - but then you look in the UI and it’s nowhere to be found.
You’re finally given the cluster you’ve been asking for… and then you’re like “OK, now how many executors do I pick?”.
You have a simple job with 1GB of data that takes 5 minutes for 1149 tasks… and 3 hours on the last task.
You have a big dataset and you know you’re supposed to partition it right, but you can’t pick a number between 2 and 50000 because you can find good reasons for both!
You search for “caching”, “serialization”, “partitioning”, “tuning” and you only find obscure blog posts and narrow StackOverflow questions.

Unless you have some massive experience or you’re a Spark committer, you’re probably using 10% of Spark capabilities.

In the Spark Optimization course you learned how to write performant code. It’s time to kick the high gear and tune Spark for the best it can be. You are looking at the only course on the web which leverages Spark features and capabilities for the best performance. With the techniques you learn here you will save time, money, energy and massive headaches.

Let’s rock.

In this course, we cut the weeds at the root. We dive deep into Spark and understand what tools you have at your disposal - and you might just be surprised at how much leverage you have. You will learn 20+ techniques for boosting Spark performance. Each of them individually can give at least a 2x perf boost for your jobs (some of them even 10x), and I show it on camera.

Skills You'll Learn

What’s in for you:

You’ll understand Spark internals to explain how Spark is already pretty darn fast
You’ll be able to predict in advance if a job will take a long time
You’ll diagnose hanging jobs, stages and tasks
You’ll spot and fix data skews
You’ll make the right performance tradeoffs between speed, memory usage and fault-tolerance
You’ll be able to configure your cluster with the optimal resources
You’ll save hours of computation time in this course alone (let alone in prod!)
You’ll control the parallelism of your jobs with the right partitioning

And some extra perks:

You’ll have access to the entire code I write on camera (~1400 LOC)
You’ll be invited to our private Slack room where I’ll share latest updates, discounts, talks,
(Soon) You’ll have access to the takeaway slides
(Soon) You’ll be able to download the videos for your offline view

Skills you’ll get:

Deep understanding of Spark internals so you can predict job performance
- stage & task decomposition
- reading query plans before jobs will run
- reading DAGs while jobs are running
- performance differences between the different Spark APIs
- packaging and deploying a Spark app
- configuring Spark in 3 different ways
- understanding the state of the art in Spark internals
- leveraging Catalyst and Tungsten for massive perf
Understanding Spark Memory, Caching and Checkpointing
- Tuning Spark executor memory zones
- caching for speedy data reuse
- making the right tradeoffs between speed, memory usage and fault tolerance
- using checkpoints when jobs are failing or you can’t afford a recomputation
Partitioning
- leveraging repartitions
- using coalesce to avoid shuffles
- picking the right number of partitions at a shuffle to match cluster capability
- using custom partitioners for custom jobs
Cluster tuning, fixing problems
- allocating the right resources in a cluster
- fixing data skews and straggling tasks with salting
- fixing serialization problems
- using the right serializers for free perf improvements

This course is for Scala and Spark programmers who need to improve the run time and memory footprint of their jobs. If you’ve never done Scala or Spark, this course is not for you. I’ll generally recommend that you take the Spark Optimization course first, but it’s not a requirement.

Meet Rock the JVM

Daniel Ciocîrlan

I'm a software engineer and the founder of Rock the JVM.

I'm a software engineer and the founder of Rock the JVM. I started the Rock the JVM project out of love for Scala and the technologies it powers - they are all amazing tools and I want to share as much of my experience with them as I can.

As of February 2024, I've taught Java, Scala, Kotlin and related tech (e.g. Cats, ZIO, Spark) to 100000+ students at various levels and I've held live training sessions for some of the best companies in the industry, including Adobe and Apple. I've also taught university students who now work at Google and Facebook (among others), I've held Hour of Code for 7-year-olds and I've taught more than 35000 kids to code.

I have a Master's Degree in Computer Science and I wrote my Bachelor and Master thesis on Quantum Computation. Before starting to learn programming, I won medals at international Physics competitions.

Legend

They say Spark is fast. How do I make the best out of it?

Let’s rock.

What’s in for you:

Skills you’ll get:

Daniel Ciocîrlan

Warmup

Foundation

Memory, Caching and Checkpointing

Partitioning

Performance Tuning, Problems and More

Epilogue

Apache Spark Performance Tuning with Scala - Lifetime License

All-Access Membership

The Apache Spark Bundle with Scala