Conquering All Your Stream Processing Needs with Kafka and Spark

Kafka Summit 2016 | Systems Track

Speaker

Tathagata Das

Tathagata Das, Software Engineer, Databricks

Description: 

Apache Spark, specifically Spark Streaming, is becoming one of the most widely used stream processing system for Kafka. At its heart, Spark is an extremely fast and general-purpose distributed data processing platform. This allows the unification of all kinds of data processing using a single framework – streaming, SQL, and machine learning. For Kafka users, this means that they can use Spark to run batch jobs, streaming pipelines as well as interactive queries on Kafka data. In this talk, I am going to give a brief overview of the Spark framework and elaborate on how different components of Spark can be used to process data from Kafka. Specifically, I am going to cover the following.

  • Real-time processing of Kafka streams with Spark Streaming
  • Batch and interactive querying of Kafka data with Spark and Spark SQL
  • Schema-aware streaming ETL from with Streaming DataFrames
Kafka Summit 2016