Accurately Backtesting Real-time Streaming Features

« Current 2023

The key question to answer: given a Flink job definition of a real-time feature, the historical Kafka data and an exact point in time in history, can we confidently say what the feature value would have been in Redis?

In this talk, I will demonstrate that it is impossible to accurately answer this question. Approaches such as Flink's batch mode are not able to accurately handle late events and indicate when in processing time a window closed. In fact, they practically guarantee feature leakage.

The key reason behind the challenge is that historical Kafka records only contains Kafka producer timestamps (event time), and does not contain the processing time. This makes it extremely challenging to say with certainty what records would have made it into a particular window, and when exactly in processing time an event time window fires based on watermarks.

I will then talk about assumptions we can make to get "good enough" backfill results that try to avoid feature leakage. The best approach likely depends on your particular Kafka/Flink setup.

Presenter

Tony Wang

Stanford

Tony got his undergrad and masters from MIT, where he spent too much time learning an ancient language called Verilog. He briefly did a startup writing assembly to speed up machine learning inference. After realizing most of his prospective customers spend more time loading data from a database than actually doing ML, he shifted his focus to pursue a PhD working on big data processing technologies at Stanford University.

Accurately Backtesting Real-time Streaming Features

Presenter

Tony Wang

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how