How to write your first Spark application with Stream-Stream Joins with working code

Canadian Data Guy
11 min readMar 23, 2023

Source: https://canadiandataguy.com/blog/spark-stream-stream-join/

Have you been waiting to try Streaming but cannot take the plunge?

In a single blog, we will teach you whatever needs to be understood about Streaming Joins. We will give you a working code which you can use for your next Streaming Pipeline.

The steps involved:

  1. Create a fake dataset at scale
  2. Set a baseline using traditional SQL
  3. Define Temporary Streaming Views
  4. Inner Joins with optional Watermarking
  5. Left Joins with Watermarking
  6. The cold start edge case: withEventTimeOrder
  7. Cleanup
https://unsplash.com/photos/GAWiEPB0uEk

Index

· What is Stream-Stream Join?
Concept of Stream-Stream Join
Types of Stream-Stream Join
· 1. The Setup: Create a fake dataset at scale
Next, we will break this Delta table into 2 different tables
· 2. Set a baseline using traditional SQL
· Summary so far:
· 3. Define Temporary Streaming Views
· 4. Inner Joins with optional Watermarking
How was the watermark computed in this scenario?
· 5. Left Joins with Watermarking
5.a How Left Joins works differently than an Inner Join
5. b What to observe:
· 6. The cold start edge case: withEventTimeOrder
· 7. Cleanup
· Download the code
References:
· Footnote:

What is Stream-Stream Join?

Stream-stream join is a widely used operation in stream processing where two or more data streams are joined based on some common attributes or keys. It is essential in several use cases, such as real-time analytics, fraud detection, and IoT data processing.

Concept of Stream-Stream Join

--

--

Canadian Data Guy

https://canadiandataguy.com | Data Engineering & Streaming @ Databricks | Ex Amazon/AWS | All Opinions Are My Own