How to upgrade your Spark Stream application with a new checkpoint With working code

Canadian Data Guy
3 min readJan 25, 2023

Index:

· Kafka Basics: Topics, partition & offset
· What information is inside the checkpoint?
· How to fetch information about Offset & Partition from the Checkpoint folder?
· Now the easy part: Use Spark to start reading Kafka from a particular Offset
· Footnote:

Sometimes in life, we need to make breaking changes which require us to create a new checkpoint. Some example scenarios:

  1. You are doing a code/application change where you are changing logic
  2. Major Spark Version upgrade from Spark 2.x to Spark 3.x
  3. The previous deployment was wrong, and you want to reprocess from a certain point

There could be plenty of scenarios where you want to control precisely which data(Kafka offsets) need to be processed.

Not every scenario requires a new checkpoint. Here is a list of things you can change without requiring a new checkpoint.

This blog helps you understand how to handle a scenario where a new checkpoint is unavoidable.

Photo by Patrick Tomasso on Unsplash

Kafka Basics: Topics, partition & offset

Kafka Cluster has Topics: Topics are a way to organize messages. Each topic has a name that is unique across the entire Kafka cluster. Messages are sent to and read from specific topics. In other words, producers write data on a topic, and consumers read data from the topic.

Topics have Partitions, and data/messages are distributed across partitions. Every message belongs to a single partition.

Partition has messages, each with a unique sequential identifier within the partition called the Offset.

What is the takeaway here?

We must identify what offset has already been processed for each partition, and this information can be found inside the checkpoint.

What information is…

--

--

Canadian Data Guy

https://canadiandataguy.com | Data Engineering & Streaming @ Databricks | Ex Amazon/AWS | All Opinions Are My Own