Spark Streaming Best Practices-A bare minimum checklist for Beginners and Advanced Users

4 min readOct 27, 2022

Most good things in life come with a nuance. While learning Streaming a few years ago, I spent hours searching for best practices. However, I would find answers to be complicated to make sense for a beginner’s mind. Thus, I devised a set of best practices that should hold true in almost all scenarios.

The below checklist is not ordered, you should aim to check off as many items as you can.

Beginners best practices checklist for Spark Streaming:

[ ] Choose a trigger interval over nothing at all because it helps control storage transaction api/Listing costs. This is because some Spark jobs have a component which requires a s3/adls listing operation. If our processing is very fast think <1 sec, we will keep repeating these operations and lead to unintended costs. Example .trigger(processingTime=’5 seconds’)
If you are using AutoLoader the switch to Notification mode https://docs.databricks.com/ingestion/auto-loader/file-notification-mode.html
Do not enable versioning on the S3 bucket, Delta tables have time travel to recover from failures as Versioning adds significant latency at scale.
Keep the compute and storage located in the same…

Spark Streaming Best Practices-A bare minimum checklist for Beginners and Advanced Users

Beginners best practices checklist for Spark Streaming:

Written by Canadian Data Guy