Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
By nature Event-driven systems transform data and propagate it across multiple services. This characteristic makes the GDPR compliance challenging. Immutable Kafka logs make it impossible to explicitly delete a published message that may contain Personally Identifiable Information (PII). A general solution has been to choose a short-enough retention duration for such topics so that the data is eventually removed within the allowed time limit. As for consumers of the data, one typically has to audit and trace where the data is propagated, and request each of the consuming services to purge their copy. Even then PII may still continue to exist, for example in backups, intermediate stating environments like S3 buckets, and ad-hoc copies of the data used for business analytics, data science, etc.
This talk presents a way to build GDPR compliance into the message propagation protocol itself, and utilise crypto-shredding to in effect render all copies of PII decipherable on demand. The talk explains how a message schema such as ProtocolBuffer can be extended to allow publishers of data to mark data as PII. It shows how GDPR compliance can be integrated into existing APIs that were not designed with GDPR in mind, with minimum disruption. It illustrates how the marked data is encrypted before it is stored in Kafka and guarantee that the data remains encrypted throughout its entire propagation journey. The talk shows how the key management system works transparently across thousands of services to control access to data with different granularity and protect against cross referencing to avoid unauthorized access to data.