Logging ingestion infrastructure at Pinterest is built around Apache Kafka to support thousands of pipelines with over 1 trillion (1PB) new messages generated by hundreds of services (written in 5 different languages) and transported to data lake (AWS S3) every day. In the past, we have focused on scalability and auto operation of the infrastructure to help internal teams quickly onboard new pipelines (Kafka Summit 2018, 2020). However, we had constantly observed data loss and data corruption due to the design decisions we made to favor scalability and availability over durability and consistency.
To tackle these problems, we designed and implemented logging auditing framework which consists of (1) audit client library integrated into every component of the infrastructure to detect data corruption for every message and send out audit events for randomly picked messages, (2) Kafka clusters receiving audit events, and (3) realtime and batch application processing audit events to generate insights for alerting and reporting.
Focusing on zero negative impact to existing ingestion pipelines, scalability and cost efficiency led us to make various design decisions to eventually achieve auditing rollout to every pipeline with zero downtime and fundamentally improve the data ingestion quality at Pinterest in general by tracking data loss and removing data corruption which in the past can block downstream applications for hours and often lead to severe incidents.