Spark Streaming Best Practices-A bare minimum checklist for Beginners and Advanced Users

Canadian Data Guy
4 min readOct 27, 2022

Most good things in life come with a nuance. While learning Streaming a few years ago, I spent hours searching for best practices. However, I would find answers to be complicated to make sense for a beginner’s mind. Thus, I devised a set of best practices that should hold true in almost all scenarios.

The below checklist is not ordered, you should aim to check off as many items as you can.

Beginners best practices checklist for Spark Streaming:

  • [ ] Choose a trigger interval over nothing at all because it helps control storage transaction api/Listing costs. This is because some Spark jobs have a component which requires a s3/adls listing operation. If our processing is very fast think <1 sec, we will keep repeating these operations and lead to unintended costs. Example .trigger(processingTime=’5 seconds’)
  • If you are using AutoLoader the switch to Notification mode https://docs.databricks.com/ingestion/auto-loader/file-notification-mode.html
  • Do not enable versioning on the S3 bucket, Delta tables have time travel to recover from failures as Versioning adds significant latency at scale.
  • Keep the compute and storage located in the same…

--

--

Canadian Data Guy

https://canadiandataguy.com | Data Engineering & Streaming @ Databricks | Ex Amazon/AWS | All Opinions Are My Own