Kafka In the Cloud: Why It’s 10x Better With Confluent | Get free eBook

Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption And Tracking Data Loss

Logging ingestion infrastructure at Pinterest is built around Apache Kafka to support thousands of pipelines with over 1 trillion (1PB) new messages generated by hundreds of services (written in 5 different languages) and transported to data lake (AWS S3) every day. In the past, we have focused on scalability and auto operation of the infrastructure to help internal teams quickly onboard new pipelines (Kafka Summit 2018, 2020). However, we had constantly observed data loss and data corruption due to the design decisions we made to favor scalability and availability over durability and consistency.

To tackle these problems, we designed and implemented logging auditing framework which consists of (1) audit client library integrated into every component of the infrastructure to detect data corruption for every message and send out audit events for randomly picked messages, (2) Kafka clusters receiving audit events, and (3) realtime and batch application processing audit events to generate insights for alerting and reporting.

Focusing on zero negative impact to existing ingestion pipelines, scalability and cost efficiency led us to make various design decisions to eventually achieve auditing rollout to every pipeline with zero downtime and fundamentally improve the data ingestion quality at Pinterest in general by tracking data loss and removing data corruption which in the past can block downstream applications for hours and often lead to severe incidents.


Heng Zhang

Heng Zhang is a software engineer focusing on Apache Kafka and Apache Flink at Pinterest. Over the past 2.5 years, he designed, built and operated the logging auditing framework to fundamentally improve the ingestion quality of all the pipelines which in total generate and transport over 1 trillion (1PB) messages every day; he also led the cross-functional effort on improving the reliability and dev velocity of streaming applications written in Apache Flink. Before, he built data services in Kubernetes on top of Apache Kafka and Apache Solr at VMWare. He received his PH.D degree from University of Maryland, College Park working on big data and machine learning.