Spark Streaming Best Practices-A bare minimum checklist for Beginners and Advanced Users
Most good things in life come with a nuance. While learning Streaming a few years ago, I spent hours searching for best practices. However, I would find answers to be complicated to make sense for a beginner’s mind. Thus, I devised a set of best practices that should hold true in almost all scenarios.
The below checklist is not ordered, you should aim to check off as many items as you can.
Beginners best practices checklist for Spark Streaming:
- [ ] Choose a trigger interval over nothing at all because it helps control storage transaction api/Listing costs. This is because some Spark jobs have a component which requires a s3/adls listing operation. If our processing is very fast think <1 sec, we will keep repeating these operations and lead to unintended costs. Example .trigger(processingTime=’5 seconds’)
- If you are using AutoLoader the switch to Notification mode https://docs.databricks.com/ingestion/auto-loader/file-notification-mode.html
- Do not enable versioning on the S3 bucket, Delta tables have time travel to recover from failures as Versioning adds significant latency at scale.
- Keep the compute and storage located in the same regions.
- [ ] Use ADLS Gen2 on Azure over blob storage as it’s better suited for big data analytics workloads. Read more on the differences here.
- [ ] Make sure the table partition strategy is chosen carefully and on low cardinality columns like date , region, country, etc. My rough rule of thumb says, if you have more than 100,000 partitions then you have over-partitioned your table. Date columns make a good partition column because they occur naturally. Example for a multinational e-commerce company which operates in 20 countries and wants to store 10 years of data. Once you partition by date & country =( 365 * 10 ) * 20 = you will end up with 73,000 partitions.
- [ ] Name your streaming query so it is easily identifiable in the Spark UI Streaming tab.