Synthetic Data Made Simple: Generating and Streaming Custom-Sized Data to Kafka

Canadian Data Guy
5 min readJun 16, 2024

Introduction

In the fast-paced world of data engineering, there often arises a need to generate large volumes of synthetic data for testing and benchmarking purposes. Recently, I was tasked with a crucial project: creating records of a specific size (1 MB each) and streaming them to Kafka for performance benchmarking. This blog post, the first in a two-part series, will walk you through how to generate such data using Python and Apache Spark, and then stream it to Kafka efficiently. Tomorrow, we’ll dive into Part 2, where we’ll benchmark Kafka against Delta ingestion speed on Databricks Jobs and Delta Live Tables.

But first, let me share the story behind this endeavor.

The Challenge: Preparing for Technology Decisions

Imagine you’re part of a data engineering team at a rapidly growing tech startup. Your CTO has tasked you with benchmarking the expected speed of Kafka to Delta ingestion before making critical technology decisions. You quickly realize two things:

  1. No suitable public Kafka feed: You need a Kafka feed that matches your specific requirements, especially in terms of record size.
  2. Complex setup with AWS MSK: Setting up AWS Managed…

--

--

Canadian Data Guy

https://canadiandataguy.com | Data Engineering & Streaming @ Databricks | Ex Amazon/AWS | All Opinions Are My Own