course

NL/EN

Apache Spark for Data Engineers Masterclass

Name: Apache Spark for Data Engineers Masterclass
Price: 1610 EUR

Get a deeper understanding of Apache Spark in order to optimize your data workflow.

May 18, 2026

- Veenendaal / Remote

- View more dates

2 days

1610 (excl. VAT)

Description

In this course, you will explore techniques and best practices for optimizing Apache Spark applications. As you study the architectural elements of Spark, you will learn to work with the Spark UI. You will identify and address common performance issues caused by shuffles and skew. Advanced optimization strategies for join, union, and merge operations, data formats, caching mechanisms, garbage collector settings, data partitioning, bucketing, and Delta Lake optimizations are also covered. Additionally, you will explore regular maintenance tasks for Spark applications and learn how to customize Spark session configurations for optimal performance.

Learning Goals

-Describe the architecture of a spark application.
+Remember
-Explain the structure and functionality of the Spark UI.
+Understand
-Predict common performance issues casued by shuffling and data skew.
+Apply
-Optimize join, union, and merge operations in Spark.
+Analyze
-Change the data format for optimal performance.
+Apply
-Implement caching mechanisms and garbage collector settings for enhanced performance.
+Apply
-Use data partitioning and bucketing in Spark workloads.
+Apply
-Apply Delta Lake optimizations for better performance in Spark.
+Apply
-Describe regular maintenance tasks for Spark applications.
+Understand
-Customize Spark session configurations for optimal performance.
+Apply

For the above learning goals we use Bloom's Taxonomy

Prior Knowledge

Python
Apache Spark fundamentals

Subjects

Introduction to Spark Architecture and Ecosystem
Understanding the Spark UI
Common Performance Issues in Spark
Optimizing Data Operations in Spark
Data Formats and Performance
Caching and Garbage Collection in Spark
Data Partitioning and Bucketing
Delta Lake Optimizations
Maintenance of Spark Applications
Customizing Spark Session Configurations

Introduction to Spark Architecture and Ecosystem

Overview of Spark architecture
Key components: Driver, Executors, Cluster Manager
The ecosystem: JVM, Kubernetes, Yarn, HDFS, Hive Metastore

Understanding the Spark UI

Structure of the Spark UI
Functionality of different tabs (Jobs, Stages, Storage, Environment, Executors)
Monitoring and diagnosing Spark applications

Common Performance Issues in Spark

Shuffles and Data Skew
Sorting
Narrow and Wide transformations

Optimizing Data Operations in Spark

Join operations: broadcast joins, shuffle joins
Union and merge operations

Data Formats and Performance

Common data formats such as json, csv and parquet
Impact of data format on performance
Making optimal use of data formats for Spark applications

Caching and Garbage Collection in Spark

Caching mechanisms in Spark (cache(), persist())
Data persistence
Garbage collection settings and their impact on performance

Data Partitioning and Bucketing

Partitioning strategies and impact in Spark
Bucketing techniques and their benefits

Delta Lake Optimizations

Introduction to Delta Lake
Performance optimization in Delta Lake
Delta Lake housekeeping

Maintenance of Spark Applications

Regular maintenance tasks for Spark applications
Monitoring and diagnostics tools

Customizing Spark Session Configurations

Spark session configurations and their impact on performance
Common spark session parameters
Customizing configurations for specific workloads

Start date	Duration	Location
May 18, 2026May 19, 2026	2 days	Veenendaal / Remote This is a hybrid training and can be followed remotely. More information Veenendaal / Remote This is a hybrid training and can be followed remotely. More information	Sign up

All courses can also be conducted within your organization as customized or incompany training.

Our training advisors are happy to help you provide personal advice or find Incompany training within your organization.

Trainers

Prior knowledge courses

course - ASFNL/EN

Apache Spark Fundamentals

Get started processing data with Apache Spark and PySpark

2 days
€ 1610
Classroom
May 11, 2026

Databases
Data Engineering
AI-Powered Applications

course - PYTHONDEVClass is guaranteed to runNL/EN

Python Fundamentals

Attain a solid foundation of Python for developing software solutions

3 days
€ 2175
Classroom
November 12, 2025

Python

"Extremely good teacher"

Sander

Hoge waardering
Praktijkgerichte trainingen
Gecertificeerde trainers
Eigen docenten

course

Apache Spark for Data Engineers Masterclass

Description

Learning Goals

Prior Knowledge

Subjects

Introduction to Spark Architecture and Ecosystem

Understanding the Spark UI

Common Performance Issues in Spark

Optimizing Data Operations in Spark

Data Formats and Performance

Caching and Garbage Collection in Spark

Data Partitioning and Bucketing

Delta Lake Optimizations

Maintenance of Spark Applications

Customizing Spark Session Configurations

Schedule

All courses can also be conducted within your organization as customized or incompany training.

Trainers

Douwe van den Berg

Josquin Booij

Prior knowledge courses

Apache Spark Fundamentals

Python Fundamentals

Blogs

Introducing the Microsoft Testing Platform runner for Stryker.NET

The fundamentals of context engineering

arc42 chapter 2: Architecture constraints

course

Apache Spark for Data Engineers Masterclass

Description

Learning Goals

Prior Knowledge

Subjects

Introduction to Spark Architecture and Ecosystem

Understanding the Spark UI

Common Performance Issues in Spark

Optimizing Data Operations in Spark

Data Formats and Performance

Caching and Garbage Collection in Spark

Data Partitioning and Bucketing

Delta Lake Optimizations

Maintenance of Spark Applications

Customizing Spark Session Configurations

Schedule

All courses can also be conducted within your organization as customized or incompany training.

Trainers

Douwe van den Berg

Josquin Booij

Prior knowledge courses

Apache Spark Fundamentals

Python Fundamentals

Related courses

Microsoft Fabric Data Engineer (DP-700)

Blogs

Introducing the Microsoft Testing Platform runner for Stryker.NET

The fundamentals of context engineering

arc42 chapter 2: Architecture constraints