View sessions and slides from Kafka Summit London 2018
In this session, we will tell the story of what we built, what went well, what didn’t go so well and what we learnt. This is a story of how a team of developers learnt (and are still learning) how to use Kafka. We hope that you will be able to take away lessons and learnings of how to build a data processing pipeline with Apache Kafka.
"William Hill is one of the UK’s largest, most well-established gaming companies with a global presence across 9 countries with over 16,000 employees. In recent years the gaming industry and in particular sports betting, has been revolutionised by technology. Customers now demand a wide range of events and markets to bet on both pre-game and in-play 24/7. This has driven out a business need to process more data, provide more updates and offer more markets and prices in real time.
At William Hill, we have invested in a completely new trading platform using Apache Kafka. We process vast quantities of data from a variety of feeds, this data is fed through a variety of odds compilation models, before being piped out to UI apps for use by our trading teams to provide events, markets and pricing data out to various end points across the whole of William Hill. We deal with thousands of sporting events, each with sometimes hundreds of betting markets, each market receiving hundreds of updates. This scales up to vast numbers of messages flowing through our system. We have to process, transform and route that data in real time. Using Apache Kafka, we have built a high throughput, low latency pipeline, based on Cloud hosted Microservices. When we started, we were on a steep learning curve with Kafka, Microservices and associated technologies. This led to fast learnings and fast failings.
In the beginning was PIPs, an API backed by a relational database used to store all the BBC’s programme metadata. But as more clients came, with more requirements and ever more complex queries, it became untenable to build one system able to service them all and maintain performance. Each client wanted a simple interface to be able to ask their specific complex questions, about subjects like availability and scheduling.
The Programme Metadata team turned to a combination of Kafka and Clojure (a functional, immutable, Lisp dialect) running in AWS to produce multiple pipelines, one per client requirement. This setup turns the normal ETL pipeline on its head, with one homogenous backend and multiple heterogenous outputs. At each level you can see the same pattern repeated, which extends even into the structure of the Clojure code itself. In this talk we’ll go through some of the things we’ve learned, look at how the structure of Clojure mirrors and supports the way Kafka is used, and see how simple commodity microservices can be reused in multiple pipelines to rapidly satisfy new client requirements.
What are the most important considerations for shipping billions of daily events to analysis? In this session, I’ll share the journey we’ve made building a reliable, near real-time data pipeline. I’ll discuss and compare several data loading techniques and hopefully assist you on making better choices with your next pipeline. MyHeritage collects billions of events every day, including request logs from web servers and backend services, events describing user activities across different platforms, and change-data-capture logs recording every change made in its databases. Delivering these events to analytics is a complex task, requiring a robust and scalable data pipeline. We have decided to ship our events to Apache Kafka, and load them for analysis in Google BigQuery.
In this talk, I’m going to share some of the lessons we learned and best practices we adopted, while describing the following loading techniques:
-Batch loading to Google Cloud Storage and using a load job to deliver data to BigQuery
-Streaming data via BigQuery API, along with Kafka Streams as the streaming framework
-Streaming data to BigQuery with Kafka Connect
-Streaming data with Apache Beam along with its cloud Dataflow runner
Along with presenting our journey, I’ll discuss some important concepts of data loading:
-Batch vs. streaming load
-Processing time partitioning vs. event time partitioning
-Considerations for running your pipeline on premise vs. in the cloud
Hopefully this case study can assist others with building better data pipeline.
At Workday, Kafka is the data backbone powering our search and analytics infrastructure in production. Our customers, the world’s largest companies, rely on our search system for recruitment, employee professional development and other business-critical applications. In addition, enterprise search needs to respect highly configurable security and access-restriction models. As a consequence, our search indices need to be consistent and up-to-date. Our search indexing pipeline processes hundreds of terabytes of data every few hours. In addition, we need a real-time incremental indexing pipeline that captures tens of thousands of transactions a second, processes them and updates indices with relevant data.
To meet these challenges, we built a robust and scalable stream processing engine backed by Kafka to fetch data from our primary data stores, process and index it to our Elasticsearch cluster.
In this talk, we would like to share our experience building a custom streaming engine based on Akka Streams and how we integrated with Kafka. We will talk about how we handle the unique challenges of working with business-critical and sensitive enterprise data. We will describe how we handle per-tenant encryption/decryption, replication, error handling and data loss. We will share our experiences configuring Kafka to handle our workload and partitioning data such that the most critical data is always up-to-date. We will also talk about our experiences with deploying and operating our system in production, describe how we keep this workhorse running at scale and share performance numbers. Last, but not the least, we will touch upon about how our Kafka/Akka streaming engine is used across our product portfolio, powering applications such as collecting user-actions for analytics, training machine-learning models, and ferrying data across microservices for our ML inference engines.
Kafka isn’t just for data engineers or distributed computing enthusiasts. But streaming data doesn’t become useful on its own: it needs to be accessible to the people who will build mobile games, performance dashboards, instant messaging systems (you get the picture) with it. My team set up Etsy’s first Kafka cluster in 2014 — as a pit stop for our clickstream data before it made its way to Hadoop for batch processing. It quickly became one of Etsy’s most reliable systems, as well as its most underutilized. It wasn’t until 2017 that engineers outside the data team began to make use of our Kafka pipeline and the data we had been streaming through it for years.
Now we’re using Kafka to develop user-facing applications and machine learning pipelines. The catalyst for this change was the deployment and monitoring platform we built to make our data sources accessible and usable. This talk will center around the evolution of Kafka at Etsy: how we built out a platform that would empower all of Etsy engineering to explore and experiment with our existing sources of streaming data. I’ll describe how we architected the deployment and monitoring platforms that allow us to support a diverse range of streaming data uses, the benefits and drawbacks of having a centralized platform for multiple services, and how we structured those services for fault isolation. I’ll also talk about how we introduced Kafka and the concepts of streaming architecture and distributed systems to engineers who had never worked with data apps: in other words, how we approached user education. The main takeaways will be technical — tips on architecture and tooling that allow for speed of development as well as functional isolation — and practical — how to support data-driven experimentation in all areas of your organization.
CERN, the European Laboratory for Particle Physics, hosts the biggest particle accelerators in the world and manages, among other resources, large data centres and a distributed computing infrastructure to pursue its scientific goals. Since two years, Apache Kafka is a core part of the new monitoring pipeline which collects, transports and processes several terabytes of metrics and logs per day from more than 30k hosts of the CERN data centres and world-wide grid distributed services. Not only Kafka proved to be the solid backbone of the ingestion infrastructure, it also enabled low-latency on-the-fly access to all monitoring data, opening new possibilities for data extraction, transformation and correlation.
Apache Flume is currently used as collector agent. A processing infrastructure is provided to users for the streaming analytics of monitoring data, based on Mesos/Marathon and Docker for job orchestration and deployment, with users developing the processing logic on the preferred framework (e.g. mainly Apache Spark, but with Kafka Streams/KSQL being an option too). The results of the analysis, as well as the raw data, are stored in HDFS as long term archive and on Elasticsearch and InfluxDB as backends for the visualisation layer, based on Grafana. This talk discusses the monitoring architecture, the challenges encountered in operating and scaling Kafka to handle billions of events per day and presents how users benefit from Kafka as central data hub for stream processing and analysis of monitoring data.
In Ooyala Adtech we have 1B+ ad events per day to be collected, delivered in near real-time to tens of different consumers, which include ad server decision engine itself, forecasting, reporting and more. We used to run it on RabbitMQ and a home-built blob storage, but it was not reliable enough, had high operation costs and made onboarding new data streams difficult. Recently we have completed replacing it with new Kafka-based solution, which we does not have the above mentioned limitations.We have switched all the consumers and changed messaged format to Avro and now we are using quite a lot from Kafka ecosystem: Kafka Streams, Google PubSub connector, S3 Connector, Schema registry and Mirror Maker and soon K-SQL and Debezium. For some of the stateful consumers we had to implement backup/restore of event topics to/from S3.
Some key takeaways:
-Kafka streams was a perfect fit for our environment, where multiple teams develop micro-services independently.
-Using Avro as event format allows for generic processing components (e.g. S3-sink) and schemas are perfect for documentation.
-Kafka time index is a very powerful feature that allowed us to develop clean backup/restore process.
-Using cloud native messaging solution (e.g. Google pub/sub) with a connector to on-premise Kafka cluster appears to be the easiest and cheapest solution for global streaming for us at the moment.
We believe our story of migration to and adoption of Kafka ecosystem might be interesting.
Yelp is composed of thousands of aligned, but autonomous people. Effectively sharing context is vital in large organizations to maintain alignment without sacrificing autonomy. Communicating context around data meaning, ownership, authority, availability, lineage, and quality is critically important in operating large-scale streaming infrastructure. This talk explores how Yelp uses Apache Kafka and managed schemas to answer questions like “What does this column mean?”, “What data is available?”, “What data should I use?”, “Is this data accurate?”, and “How can I get that data?
Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again! Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone. In this talk, we’ll see how easy it is to stream data from a database such as Oracle into Kafka using the Kafka Connect API. In addition, we’ll use KSQL to filter, aggregate and join it to other data, and then stream this from Kafka out into multiple targets such as Elasticsearch and MySQL. All of this can be accomplished without a single line of code! Why should Java geeks have all the fun?
As Kafka keeps rising in popularity, people are increasingly adopting it as a backbone to store and process large streams of immutable events (Event Streaming as the Source of Truth). Also, we need to maintain integration between new data sources and the existing IT systems, as well as evaluate and use new technologies to analyse the captured events. We can observe this trend by the growing number of Kafka connectors. In this talk we focus on the Kafka Connect architecture and the practicalities around building a connector. Not all data sources are created equal and different scenarios require different approaches.
Having assisted customers develop their own connectors to integrate with a number of different data end points, this talk will dive under the hood and look at:
-Key patterns to handle different loads across the Kafka connect distributed framework.
-Impact of different offset management strategies on message reliability.
-Challenges and pitfalls to be aware of.
Using existing connectors as examples, as well as sample connector code, this pragmatic talk aims to deliver practical insight into this process. If you are a Kafka enthusiast looking to develop a better understanding of existing connectors, or contribute to new ones, this talk is for you.
Dynamic compliance implementation can be a source of competitive advantage for banks. To succeed we need to be fast, efficient, precise and adaptive to future changes. Platform design needs to enable these capabilities. Problem: Starting with a combination of legacy systems and microservices-architecture we were challenged to implement Anti Money Laundering functionality as well as competitive user functionality whilst making the most of limited resources. We wanted to achieve speed, efficiency, compliance and improved agility for future changes.
-Real time decision-analysis of transactions in a Kafka stream.
-The analysis is based on separate data-sources consumed through Kafka topics: (1) Customer images constructed via event sourcing. (2) Transaction history, includes KTables with aggregated transaction -information. (3) Customer risk-evaluation history.
-Result: Successful implementation of AML-compliance on Kafka. On time, cost effective, compliant and highly adaptable for future needs.
-Next steps: We have identified several additional applications of Kafka both to improve compliance controls but also to automate and accelerate customer interactions. The presentation will discuss some of the best cases we see going forward.
KSQL is an open source streaming SQL engine for Apache Kafka. It provides an easy and completely interactive SQL interface for stream processing on Kafka-no need to write any code in a programming language such as Java or Python. As such, KSQL makes stream processing with Kafka significantly more accessible to many more users inside an organization. And while many users may be familiar with traditional SQL technologies such as Oracle or MySQL from the database world, applying SQL to streams of data has several unique and important differences that need to be understood in order to deploy KSQL as a stream processing technology to production environments.
In this talk we take a deep-dive into key internal concepts and architecture of KSQL as a representative of the recently emerging technologies for “streaming SQL”. We discuss the lifecycle of a query from the time it is submitted by the user, to the time it is executed continuously in the KSQL engines 24×7, and until it is terminated. We also explain standalone and distributed client-server deployment options for KSQL along with how KSQL statements are executed in each of these options. Armed with such knowledge, KSQL users will be able to use and operationalize KSQL in the various supported deployment modes much more smoothly and to prevent many mistakes in deployment and maintenance of KSQL.
For a long time the database has been the defacto standard for data, but as architectures grow to include many applications, services and databases, data settles into disparate islands that are increasingly hard to tie together. A problem that is accentuated as business workloads become more data intensive and development cycles shorten. A different approach to this problem is to focus on a central set of event streams. Kafka provides a unique basis for this type of system: it can retain datasets long term, move data quickly to code and provides the tooling needed-Kafka Streams and KSQL-for melding data into the business operations our applications and services perform. In this talk we will walk through how to build such systems. We start with a simple web application built entirely from Kafka Streams. From these simple beginnings we’ll scale the system out with Replicator, Schema Registry and Connect as we evolve the approach out of the micro towards larger, department and company-sized ecosystems.
Centrica Connected Home is one of the largest connected home providers in the UK. It sells the Hive brand of products which include smart lights, sensors and plugs in addition to it's Active Heating system. We build real time analytics and machine learning solutions for our customers using Kafka Streams, Kafka Connect and Spark. Kafka is the backbone of our real time analytics and machine learning platform and our applications are deployed on Kubernetes. We want to present our use of Kafka Streams to build analytics and machine learning solutions and running Kafka Streams on Kubernetes. The talk would go over our reasons for picking Kafka Streams as our preferred streaming solution, experiences with other streaming frameworks, the scalability and reliability Kafka Streams provides us and the kind of problems we solve for our customers.
6point6 worked with a large-scale customer to design and develop a scalable and extensible Shared Data Platform. The immediate goals were to persist and access essential data across multiple client programmes, including migrating from a legacy estate as well as multiple modern systems. Key requirements for this platform were availability, zero data loss and the avoidance of transactions. This talk will cover how Kafka and KStreams were leveraged to deliver a platform built around event sourcing. The solution boasts both high throughput and semi-strong consistency, whilst providing flexibility and the ability for multiple customers to interact with the same record. It will also cover the before and after architectures, the issues experienced, how development teams were impacted, and how platform performance and reliability were improved.
Learn how we:
-Scaled the platform on Amazon Web Services to handle billions of messages.
-Achieved low-latency eventual consistency at 50,000+ messages per second.
-Leveraged a blockchain-like ledger of changes to guarantee data integrity with point-in-time snapshots.
-Provided an atomic and idempotent platform without using distributed transactions.
Come learn how the combination of Apache Kafka and Apache Flink is making stateful stream processing even more expressive and flexible to support applications in streaming that were previously not considered streamable.
The new world of applications and fast data architectures has broken up the database: Raw data persistence comes in the form of event logs, and the state of the world is computed by a stream processor. Apache Kafka provides a strong solution for the event log, while Apache Flink forms a powerful foundation for the computation over the event streams.
In this talk we discuss how Flink’s abstraction and management of application state have evolved over time and how Flink’s snapshot persistence model and Kafka’s log work together to form a base to build ‘versioned applications’. We will also show how end-to-end exactly-once processing works through a smart integration of Kafka’s transactions and Flink’s checkpointing mechanism.
Imagine you must make data-driven decisions in real-time, whether that’s detecting anomalies and fraudulent activities in data feeds, monitoring application behavior and infrastructure, processing CDC information of your databases, or doing real-time ETL. Stream processing is the solution, but unfortunately the world of stream processing still has a very high barrier to entry. Today’s most popular stream processing technologies require the user to write code in programming languages such as Java or Scala. This hard requirement on coding skills is preventing many companies to unlock the benefits of stream processing to their full effect.
In this talk, I introduce the audience to KSQL, the open source streaming SQL engine for Apache Kafka. KSQL provides an easy and completely interactive SQL interface for data processing on Kafka — no need to write any code in a programming language. Instead, all you need is a simple SQL statement such as SELECT * FROM payments-kafka-stream WHERE fraudProbability > 0.8. KSQL brings together the worlds of streams and databases by allowing you to work with your data in a stream and in a table format. Built on top of Kafka’s Streams API, KSQL supports many powerful operations including filtering, transformations, aggregations, joins, windowing, sessionization, and much more. It is open source (Apache 2.0 licensed), distributed, scalable, fault-tolerant, and real-time. You will learn how KSQL makes it easy to get started with a wide range of stream processing use cases such as those described at the beginning. We cover how to get up and running with KSQL, explain its core data abstractions, walk through several use cases, describe its deployment options, and also explore the under-the-hood details of how it all works.
Pyeongchang, South Korea – February 2018. The world is watching the Winter Olympics. Hockey, Skiing, Curling, Figure Skating and Bobsleighing: everything is being streamed online. Gracenote Sports set out to create a Kafka Streams pipeline as a lightning fast backend for Olympic Data widgets, powering a major Olympic broadcaster’s apps and website. Combining streams of real-time results, intermediate times, provisional standings, video logs and athletes’ biographies. For a gold medal in viewing experience. We are going to present an overview of our development using Kafka Streams to enhance our existing rich Olympic model and infrastructure. We will also present some of our experiences and results of running in production during the Olympics.
In this session we present our journey of building a queryable real-time streaming analytics engine using the Kafka Streams API. Keeping reliability and fault tolerance in mind, we created a system, across data centers, to process millions of events per second and to provide real-time insights into user behavior on the internet. Using powerful tools like KSQL we were able to generate intuitive insights into session based metrics like active sessions, current active users, open orders, etc. Further, we funneled all our data into Druid to perform low latency OLAP queries that powers our free-form reporting engine. We also highlight the lessons we learnt along the way and describe certain metrics that were important in insuring the integrity of our Kafka cluster.
Push down filters are nothing new, but with KSQL we can now start to push down our filters for streaming analysis. As the old saying goes, the fastest data to process is data you don’t have to load, and this is especially true in streaming systems. This talk will look proof-of-concept (e.g. code, but not code you should run in production) data sources built to push down filtering operations into KSQL from other stream processing systems. During this talk you will learn more about how Spark Streaming & Apache BEAM data sources work, and the work required to add filter push down in Spark Structured Streaming. While this talk will do its best to avoid summoning Cthulhu, mixing evaluation engines is a delicate exercise.
Every business has a central nervous system through which all information flows and around which all decisions are made. Sometimes this system is ad-hoc and non-standard, resulting in an architecture that is difficult to reason about and even harder to keep running. Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro.
As parts of our organization moved to running code in Kubernetes, an abstract container scheduler, we wanted to know if we could follow suit with our Kafka infrastructure. Kafka, though, is not like other workloads, and has its own set of constraints and expectations. This talk will begin with a description of Kubernetes, and then outline the challenges we faced — and solutions we developed — to run this heavily stateful application in this new world. Finally, we will offer our thoughts on whether, in retrospect, it was an experiment worth doing at all.
In this session we will discuss the performance analysis we performed while scaling Kafka in order to achieve a high throughput cluster with a large number of partitions. Decisions like which disk storage options are the most cost effective and which OS tuning knobs we turned to optimize the throughput will be discussed. We will also talk about performance effects as the utilisation of a cluster increases and the practical limits we found. Benchmark test details will be presented as well as caveats encountered during the testing. We hope that his will be helpful to anyone trying to create a large Kafka cluster on the cloud or in the datacenter.
Nowadays users expect instantaneous, real-time interactions with a company’s products and services. Underneath these products and services you will typically find a stream processing architecture with distributed applications and microservices at work to serve all these users. In these modern architectures a key challenge is to ensure smooth and correct operation in spite of machine failures, network outages, and so on. For example, a short outage of a payment gateway should not lead to a payment being lost or sent multiple times.
The most popular streaming platform today is Apache Kafka, which is used by thousands of companies such as Zalando, Netflix, Uber, PayPal, and AirBnB to transform and reshape their industries. Kafka’s rise in popularity has demanded a revisit of its traditional at-least-once processing semantics, where Kafka had thus far guaranteed that, going back to the previous example, a payment would never be lost, but a payment could be processed more than once in the face of failures.
In this talk, we present the recent additions to Apache Kafka 1.0 to achieve what is often referred to as “the holy grail” of stream processing: the support for exactly-once processing semantics. Kafka now guarantees that a payment will be processed exactly once — not more, not less often. We discuss the newly introduced transactional APIs in Kafka, and we will demonstrate at the example of Kafka’s Streams API how these APIs can be leveraged very easily for implementing end-to-end use cases with the new, strong processing guarantees.
At the New York Summit, Ferd Scheepers, Chief Information Architect for ING, took you on a journey during his keynote. In this new session we want to tell you more about this journey and how it relates to Kafka. It all started within ING when a solution was needed to improve our fraud detection system for online banking. The system was expensive, not resilient and hard to scale up to be able to process the ever growing amount of data needed for real-time fraud detection.
Kafka was a perfect solution to solve these issues and once we had our Kafka cluster in production other use cases started pouring in and Kafka became an essential component in our banks data streaming pipeline. Kafka has evolved from a simple proof of concept in 2014 to the so called Event Bus that supports a large number of data streaming use cases.
We will deep dive into some of these use cases where Kafka is used in combination with Nginx, Flink and Cassandra. Besides our current use cases, our future plans will be presented and our global Customer Contact and Notification architecture will be explained. This architecture will enable us to provide our customers with spot on, relevant and actionable insights, based on real-time data.
We will also discuss the challenges of supporting multiple ING countries with their specific requirements and show how our hub ands poke model will help to overcome these challenges. We see Kafka as a great addition to our company. It enables us to improve our customers banking experience and helps us to become a truly event-driven bank.We believe that other companies can learn and benefit from our experiences and the challenges that we have encountered.
My team at Zalando fell in love with KStreams and their programming model straight out of the gate. However, as a small team of developers, building out and supporting our infrastructure while still trying to deliver solutions for our business has not always resulted in a smooth journey. Can a small team of a couple of developers run their own Kafka infrastructure confidently and still spend most of their time developing code? In this talk, we will dive into some of the problems we experienced while running Kafka brokers and Kafka streams applications, as well as the consultations we had with other teams around this matter. We will outline some of the pragmatic decisions we made regarding backups, monitoring and operations to minimize our time spent administering our Kafka brokers and various stream applications.
Testing Kafka sounds easy, especially as there are scripts and guides on how to get the software running within minutes. These make it possible to easily test and demonstrate ‘Kafka 101.’ There are some hard-core tests, for instance using Jepsen to simulate network partitions etc. However, we’ve yet to discover the middle ground that’d enable teams to easily and effectively test key aspects of using Kafka as a platform for their context, environment, and applications. In October 2017 we started testing Kafka’s suitability for a global pan data-centre deployment.
This includes testing:
-Performance and scalability using representative data, configurations, environments and consumers.
-Service operability to assess practical aspects of operating Kafka longer-term including equipment upgrades, reconfigurations, migrations, etc.
-Robustness and FMEDA of Kafka as a service; where networks, nodes, clusters, and clients misbehave, crash, and fail.
Vitally we include human factors and testing standard operating procedures, e.g., how to recover from problems that may occur when using Kafka in production. Fitness-for-purpose: Kafka is one possible solution. How well does it suit the business needs? And from a technical perspective, which Kafka implementation should we use, which API, versions, build processes, etc.? Also, how well does Kafka run, behave and integrate in the client’s environment?
Our approach to testing includes iterating quickly using non-confidential data, equipment and environments outside the client’s domain. The tests are adapted to increase the fidelity and relevance for the client’s domain and performed in their confidential environments. This talk provides a case study of the testing we designed and implemented. Some of the work is being open-sourced to facilitate others to perform similar testing. The audience has the opportunity to learn from our experiences and adapt these to improve their testing and assessment of Kafka and related technologies.
Kafka has become a critical part of most large scale distributed infrastructures. We provide it as a self-service tool where topics are first className citizens. Due to its nature, Kafka is used as a connector between projects of different teams. This presents operations, multi-tenancy, and security challenges in providing a managed service to large organizations. In this talk we will present how we built infrastructure that enables teams to use Kafka while keeping things secure, flexible and maintaining low operational overhead. We’ll talk about our data model, cluster discovery, end-to-end asymmetric encryption framework, and identity management.
A dramatic increase in usage of financial services through internet and digital channels in recent years has blurred the border between online banking and the satisfaction of client needs in the world of IoE. Traditional Core Banking platforms face new challenges such as the volumes of data they need to process in real time, workloads of tens of thousands of transactions per second, 24×7 availability, etc. with which they can no longer cope. Distributed computing technologies unlock web-scale, client-centric architectures for next generation banking platforms that can handle hundreds of thousands of TPS and can support built-in machine learning algorithms and AI. This will be a required foundation to provide best in className financial services based on real-time analytics of client behavior and decision making patterns.
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
-Under-replicated Partitions: The mother of all metrics
-Request Latencies: Why your users complain
-Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!