Streaming Any File Type with Autoloader in Databricks: A Working Guide

Canadian Data Guy
4 min readJan 4, 2024

Spark Streaming has emerged as a dominant force as a streaming framework, known for its scalable, high-throughput, and fault-tolerant handling of live data streams. While Spark Streaming and Databricks Autoloader inherently support standard file formats like JSON, CSV, PARQUET, AVRO, TEXT, BINARYFILE, and ORC, their versatility extends far beyond these. This blog post delves into the innovative use of Spark Streaming and Databricks Autoloader for processing file types which are not natively supported.

The Process Flow:

  1. File Detection with Autoloader: Autoloader identifies new files, an essential step for real-time data processing. It ensures every new file is detected and queued for processing, providing the actual file path for reading.
  2. Custom UDF for File Parsing: We develop a custom User-Defined Function (UDF) to manage unsupported file types. This UDF is crafted specifically for reading and processing the designated file format.
  3. Data Processing and Row Formation: Within the UDF, we process the file content, transforming it into structured data, usually in row format.
  4. Writing Back to Delta Table: We then write the processed data back to a Delta table for further use.

In the below example is for ROS Bag but the same method could be translated for any other file type.

--

--

Canadian Data Guy

https://canadiandataguy.com | Data Engineering & Streaming @ Databricks | Ex Amazon/AWS | All Opinions Are My Own