Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption And Tracking Data Loss

« Kafka Summit Americas 2021

Logging ingestion infrastructure at Pinterest is built around Apache Kafka to support thousands of pipelines with over 1 trillion (1PB) new messages generated by hundreds of services (written in 5 different languages) and transported to data lake (AWS S3) every day. In the past, we have focused on scalability and auto operation of the infrastructure to help internal teams quickly onboard new pipelines (Kafka Summit 2018, 2020). However, we had constantly observed data loss and data corruption due to the design decisions we made to favor scalability and availability over durability and consistency.

To tackle these problems, we designed and implemented logging auditing framework which consists of (1) audit client library integrated into every component of the infrastructure to detect data corruption for every message and send out audit events for randomly picked messages, (2) Kafka clusters receiving audit events, and (3) realtime and batch application processing audit events to generate insights for alerting and reporting.

Focusing on zero negative impact to existing ingestion pipelines, scalability and cost efficiency led us to make various design decisions to eventually achieve auditing rollout to every pipeline with zero downtime and fundamentally improve the data ingestion quality at Pinterest in general by tracking data loss and removing data corruption which in the past can block downstream applications for hours and often lead to severe incidents.

Presentador

Heng Zhang

Heng is a software engineer building large scale db / log ingestion and stream processing platforms around Apache Kafka and Apache Flink at Pinterest.

Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption And Tracking Data Loss

Presentador

Heng Zhang

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how