SQL is the lingua franca of data analysis, but should we use it more as data engineers?
Modern tools like dbt make it easier to express transformations in SQL, but streaming is more complicated than batch. Streaming pipelines usually require higher SLAs and many CI/CD and observability practices, so data engineers prefer to use familiar languages like Python, Java and Scala along with many useful frameworks and libraries. Can SQL replace that?
I was very skeptical when I first heard the idea of using SQL for writing somewhat complex stream-processing data application a few years ago. How do you unit test it? How do you version it?
Over the years, Spark SQL streaming, Flink SQL, ksqlDB and similar tools have matured, now they easily support complex stateful transformations. However, developer experience is still questionable: it’s easy to write a SQL statement, but how do you maintain it over the years as a long-running application?
In this presentation, I hope to share the discoveries I made over the years in this area, as well as working practices and patterns I’ve seen.