course
Advanced Apache Spark for Data Engineers
Get a deeper understanding of Apache Spark in order to optimize your data workflow.

Description
In this course, you will explore techniques and best practices for optimizing Apache Spark applications. As you study the architectural elements of Spark, you will learn to work with the Spark UI. You will identify and address common performance issues caused by shuffles and skew. Advanced optimization strategies for join, union, and merge operations, data formats, caching mechanisms, garbage collector settings, data partitioning, bucketing, and Delta Lake optimizations are also covered. Additionally, you will explore regular maintenance tasks for Spark applications and learn how to customize Spark session configurations for optimal performance.
Learning Goals
Prior Knowledge
- Python
- Apache Spark fundamentals
Subjects
- Introduction to Spark Architecture and Ecosystem
- Understanding the Spark UI
- Common Performance Issues in Spark
- Optimizing Data Operations in Spark
- Data Formats and Performance
- Caching and Garbage Collection in Spark
- Data Partitioning and Bucketing
- Delta Lake Optimizations
- Maintenance of Spark Applications
- Customizing Spark Session Configurations
Introduction to Spark Architecture and Ecosystem
- Overview of Spark architecture
- Key components: Driver, Executors, Cluster Manager
- The ecosystem: JVM, Kubernetes, Yarn, HDFS, Hive Metastore
Understanding the Spark UI
- Structure of the Spark UI
- Functionality of different tabs (Jobs, Stages, Storage, Environment, Executors)
- Monitoring and diagnosing Spark applications
Common Performance Issues in Spark
- Shuffles and Data Skew
- Sorting
- Narrow and Wide transformations
Optimizing Data Operations in Spark
- Join operations: broadcast joins, shuffle joins
- Union and merge operations
Data Formats and Performance
- Common data formats such as json, csv and parquet
- Impact of data format on performance
- Making optimal use of data formats for Spark applications
Caching and Garbage Collection in Spark
- Caching mechanisms in Spark (cache(), persist())
- Data persistence
- Garbage collection settings and their impact on performance
Data Partitioning and Bucketing
- Partitioning strategies and impact in Spark
- Bucketing techniques and their benefits
Delta Lake Optimizations
- Introduction to Delta Lake
- Performance optimization in Delta Lake
- Delta Lake housekeeping
Maintenance of Spark Applications
- Regular maintenance tasks for Spark applications
- Monitoring and diagnostics tools
Customizing Spark Session Configurations
- Spark session configurations and their impact on performance
- Common spark session parameters
- Customizing configurations for specific workloads
Schedule
Start date | Duration | Location | |
---|---|---|---|
June 16, 2025June 17, 2025 | 2 days | Veenendaal / Remote This is a hybrid training and can be followed remotely. More information Veenendaal / Remote This is a hybrid training and can be followed remotely. More information | Sign up |
All courses can also be conducted within your organization as customized or incompany training.
Our training advisors are happy to help you provide personal advice or find Incompany training within your organization.
Trainers
Prior knowledge courses
"Very pleasant teacher, gave a very good interpretation of the course in their own way. It was nice to follow the course like that."Marieke
-
Hoge waardering
-
Praktijkgerichte trainingen
-
Gecertificeerde trainers
-
Eigen docenten