Kafka plays a vital role in our infrastructure at CrowdStrike. We ingest around 5 trillion events per week into our cloud platform and it is very important that this platform is available, operational, reliable and maintainable.
In this talk we will explore how we realised that vision of production readiness at scale by categorising the open source and internal tools we use into 4 quadrants.
In each quadrant we explore the details of tools used and highlight the critical roles they play in operating our streaming data infrastructure.
We start with observability, where we will explore Kafka broker and consumer monitoring tooling. We will take a look at the auto remediations we created to help improve our availability and operability. Finally we will explore how we use message tracing to help us detect message loss and auto recover them to improve data quality.