Optimizing Databricks SQL: Achieving Blazing-Fast Query Speeds at Scale

Canadian Data Guy
3 min readSep 12

In this data age, delivering a seamless user experience is paramount. While there are numerous ways to measure this experience, one metric stands tall when evaluating the responsiveness of applications and databases: the P99 latency. Especially vital for SQL queries, this seemingly esoteric number is, in reality, a powerful gauge of the experience we provide to our customers. Why is it so crucial? And how can we optimize it to ensure our databases aren’t just fast, but consistently reliable for 99% of our users? Join us as we demystify P99 latency and delve into strategies to fine-tune it in Databricks SQL.

Photo by Håkon Sataøen on Unsplash

What is P99 Latency?

The P99 latency (also known as the 99th percentile latency) for SQL queries is a metric used to measure the response time of SQL queries in a database system. It represents the latency at which 99% of the queries have a response time less than or equal to the P99 latency value, and 1% have a response time greater than the P99 latency value.

In other words, P99 latency helps you understand the worst-case response time for most of your SQL queries. It is often used to evaluate the performance of a database system and ensure that the vast majority of queries are responding quickly, even under heavy load.

For example, if the P99 latency for a particular SQL query is 100 milliseconds, it means that 99% of the time, that query will execute in 100 milliseconds or less. However, in 1% of cases, it may take longer than 100 milliseconds.

To achieve a P99 latency of 5 seconds in Databricks SQL, you can follow these steps:

  1. Optimize the table hourly by applying a WHERE filter on the timestamp.
  2. Use at least a Medium instance of DBSQL. Larger instances provide better query performance if queries rely on caching. Caching is done at the instance level; if you have N small instances, then you will have N copies of Cached data occupying memory.
  3. Set spark.databricks.delta.stalenessLimit to x minutes per business requirements. This parameter defines how many minutes of old data are acceptable.
  4. Ensure that the columns used in the…
Canadian Data Guy

https://canadiandataguy.com | Data Engineering & Streaming @ Databricks | Ex Amazon/AWS | All Opinions Are My Own