Are you considering converting your daily batch ETLs into a new and exhilarating realtime framework? We’ll help you look before you leap as we take a deep dive into the unique operational challenges entailed in transitioning data processing paradigms.
As batched data pipelines consume data from well defined time intervals and write results to partitioned data storage, batched jobs are often idempotent, so the failure recovery is simply rerunning the faulty job instances. Batched data processes are triggered at a certain frequency (e.g. daily or hourly), so the data latency is determined by both the job scheduler and job run time. Therefore, many advanced data use cases, such as frequency capping, requires event streaming to enable real-time data insights. Event streaming applications process unbounded input data in real-time and append output to message queues and/or tables to be further processed. However, real-time data insights are no free meal - because event streaming comes with many unique engineering challenges, such as handling late-arriving and duplicate events, implementing event-time partitioning, and backfilling historical data after failures. In addition, batched-driven and even streaming are not incompatible to each other but can often be better together, as the Delta and Kappa Architecture are commonly adopted in modern data systems.
In this session, we will demystify operational complexity of event streaming in the real data engineering world and share best practices learned from developing and maintaining web-scale data systems at Netflix. After attending the session, you will gain a comprehensive understanding of the trade-offs between batched data processing and even streaming and make better data system design decisions for your business/research use cases.