Conquering All Your Stream Processing Needs with Kafka and Spark

Kafka Summit 2016 | Systems Track


Tathagata Das

Tathagata Das, Software Engineer, Databricks


Apache Spark, specifically Spark Streaming, is becoming one of the most widely used stream processing system for Kafka. At its heart, Spark is an extremely fast and general-purpose distributed data processing platform. This allows the unification of all kinds of data processing using a single framework – streaming, SQL, and machine learning. For Kafka users, this means that they can use Spark to run batch jobs, streaming pipelines as well as interactive queries on Kafka data. In this talk, I am going to give a brief overview of the Spark framework and elaborate on how different components of Spark can be used to process data from Kafka. Specifically, I am going to cover the following.

  • Real-time processing of Kafka streams with Spark Streaming
  • Batch and interactive querying of Kafka data with Spark and Spark SQL
  • Schema-aware streaming ETL from with Streaming DataFrames
Kafka Summit 2016