View sessions and slides from Kafka Summit San Francisco 2019
Application teams in JPMC have started shifting towards building event driven architectures and real time steaming pipelines and Kafka has been at core in this journey. As application teams have started adopting Kafka rapidly, need for a centrally managed Kafka as a service has emerged. We have started delivering Kafka as a service in early 2018 and running in production for more than an year now operating 80+ clusters (and growing) in all environments together. One of the key requirements is to provide truly segregated, secured multi-tenant environment with RBAC model while satisfying financial regulations and controls at the same time. Operating clusters at large scale requires scalable self-service capabilities and cluster management orchestration. In this talk we will present - Our experiences in delivering and operating secured, multi-tenant and resilient Kafka clusters at scale. - Internals of our service framework/control plane which enables self-service capabilities for application teams, cluster build/patch orchestration and capacity management capabilities for TSE/admin teams. - Our approach in enabling automated Cross Datacenter failover for application teams using service framework and confluent replicator.
In this presentation, we introduce static membership (KIP-345) and share the story of adopting it at Pinterest. The static membership aims to improve the availability of stream applications, consumer groups and other applications built on top of it. The original rebalance protocol relies on the group coordinator to allocate entity ids to group members. These generated ids are ephemeral and will change when members restart and rejoin. For consumer based apps, this "dynamic membership" can cause a large percentage of tasks re-assigned to different instances during administrative operations such as code deploys, configuration updates and periodic restarts. For large state applications, shuffled tasks need a long time to recover their local states before processing and cause applications to be partially or entirely unavailable. At Pinterest, the group membership is stable between administrative operations. Motivated by this observation, we modified the Kafka's group management protocol allowing group members to provide persistent entity ids. Group membership remains unchanged based on those ids, thus no rebalance will be triggered. We can conveniently leverage Kubernetes or other cloud management frameworks to provide entity ids. By adopting static membership to the realtime infrastructure at Pinterest, applications resume processing only a few seconds after administrative operations finish. Previously with dynamic membership, it can take more than 30 minutes before applications resume. The talk is organized as follows: we first review Kafka's group management protocol and demonstrate high availability use cases that dynamic membership is unable to support. Then we share the design and adoption story of static membership. At the end, we do a live demo to show the impact of static membership.
Uber has one of the largest Kafka deployment in the industry. To improve the scalability and availability, we developed and deployed a novel federated Kafka cluster setup which hides the cluster details from producers/consumers. Users do not need to know which cluster a topic resides and the clients view a "logical cluster". The federation layer will map the clients to the actual physical clusters, and keep the location of the physical cluster transparent from the user. Cluster federation brings us several benefits to support our business growth and ease our daily operation. In particular, Client control. Inside Uber there are a large of applications and clients on Kafka, and it's challenging to migrate a topic with live consumers between clusters. Coordinations with the users are usually needed to shift their traffic to the migrated cluster. Cluster federation enables much control of the clients from the server side by enabling consumer traffic redirection to another physical cluster without restarting the application. Scalability: With federation, the Kafka service can horizontally scale by adding more clusters when a cluster is full. The topics can freely migrate to a new cluster without notifying the users or restarting the clients. Moreover, no matter how many physical clusters we manage per topic type, from the user perspective, they view only one logical cluster. Availability: With a topic replicated to at least two clusters we can tolerate a single cluster failure by redirecting the clients to the secondary cluster without performing a region-failover. This also provides much freedom and alleviates the risks for us to carry out important maintenance on a critical cluster. Before the maintenance, we mark the cluster as a secondary and migrate off the live traffic and consumers. We will present the details of the architecture and several interesting technical challenges we overcame.
When it comes to choosing a distributed streaming platform for real-time data pipelines, everyone knows the answer - Apache Kafka! And when it comes to deploying applications at scale without needing to integrate different pieces of infrastructure yourself, the answer nowadays is increasingly Kubernetes. However, with all great things, the devil is truly in the details. While Kubernetes does provide all the building blocks that are needed, a lot of thought is required to truly create an enterprise-grade Kafka platform that can be used in production. In this technical deep dive, Michael and Viktor will go through challenges and pitfalls of managing Kafka on Kubernetes as well as the goals and lessons learned from the development of the Confluent Operator for Kubernetes. NOTE: This talk together with Michael Ng from Confluent
We have been served well by Zookeeper over the years, but it is time for Kafka to stand on its own. This is a talk on the ongoing effort to replace the use of Zookeeper in Kafka: why we want to do it and how it will work. We will discuss the limitations we have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. This effort will not be completed over night, but we will discuss our progress, what work is remaining, and how contributors can help. (Note that I am proposing this as a joint talk with Colin McCabe, who is also a committer on the Apache Kafka project.)
Abstract Summary: While Apache Kafka is designed to be fault-tolerant, there will be times when your Kafka environment just isn't working as expected. Whether it's a newly configured application not processing messages, or an outage in a high-load, mission-critical production environment, it's crucial to get up and running as quickly and safely as possible. IBM has hosted production Kafka environments for several years and has in-depth knowledge of how to diagnose and resolve problems rapidly and accurately to ensure minimal impact to end users. This session will discuss our experiences of how to most effectively collect and understand Kafka diagnostics. We'll talk through using these diagnostics to work out what's gone wrong, and how to recover from a system outage. Using this new-found knowledge, you will be equipped to handle any problem your cluster throws at you.
Running Apache Kafka sometimes presents interesting challenges, especially when operating at scale. In this talk we share some of our experiences operating Apache Kafka as a service across a large company. What happens when you create a lot of partitions and then need to restart brokers? What if you find yourself with a need to reassign almost all partitions in all of your clusters? How do you track progress on large-scale reassignments? How do you make sure that moving data between nodes in a cluster does not impact producers and consumers connected to the cluster? We invite you to dive into a few of the issues we have encountered and share debugging and mitigation strategies.
Upgrades suck. We get it. They are risky and time consuming and you have better things to do. In this talk we'll present good reasons to upgrade anyway and give suggestions on how to de-risk your upgrades. Straight from the team that upgrades Kafka almost every week. We'll review all the releases in the past year - major, minor and bug-fixes. We'll explain the differences between those and what can you expect from each. We'll go into the most important features and most critical fixes and improvements, so you'll have ample ammunition when you explain to your boss why you really have to upgrade Kafka. Then we'll discuss how we validate new releases and suggest a safe upgrade process - because we know that uneventful upgrades are a key to the next upgrade.
Cloud providers like AWS allow free data transfers within an Availability Zone (AZ), but bill users when data moves between AZs. When the data volume streamed through Kafka reaches big data scale, (e.g. numeric data points or user activity tracking), the costs incurred by cross-AZ traffic can add significantly to your monthly cloud spend. Since Kafka serves reads and writes only from leader partitions, for a topic with a replication factor of 3, a message sent through Kafka can cross AZs up to 4 times. Once when a producer produces a message onto broker in a different AZ, two times during Kafka replication, and once more during message consumption. With careful design, we can eliminate the first and last part of the cross AZ traffic. We can also use message compression strategies provided by Kafka to reduce costs during replication. In this talk, we will discuss the architectural choices that allow us to ensure a Kafka message is produced and consumed within a single AZ, as well as an algorithm that lets consumers intelligently subscribe to partitions with leaders in the same AZ. We will also cover use cases in which cross-AZ message streaming is unavoidable due to design limitations. Talk outline: 1) A review of Kafka replication, 2) Cross-AZ traffic implications, 3) Architectural choices for AZ-aware message streaming, 4) Algorithms for AZ-aware producers and consumers, 5) Results, 6) Limitations, 7) Takeaways.
Over the years at Yelp, we have relied on Kafka to build many complex applications and stream processing data-pipelines that solve a multitude of use cases, including powering our product experimentation workflow, search indexing, asynchronous task processing and more. Today, Kafka is at the core of our infrastructure. These applications use different versions of Kafka clients and different programming languages.To fulfill the requirements of these diverse use cases, we run several specialized Kafka clusters for high-availability, consistency, exactly-once and infinite retention. We endeavor to keep our clusters up-to-date with newer Kafka versions that bring with them several critical bug fixes and exciting features like dynamic broker configuration, exactly-once semantics, kafka offset management and improved tooling. Our journey with Kafka started with version 0.8.2.0. Upgrading Kafka while ensuring client compatibility, zero-downtime, negligible performance degradation across our ever-growing multi-regional cluster deployment exposed us to a plethora of unique challenges. This session will focus on the challenges we encountered and how we evolved our infrastructure tooling and upgrade strategy to overcome them. I will be talking about: -- How we rolled out new features such as kafka offset storage, message timestamp, reassignment auto-throttling, etc. -- Core technical issues discovered during upgrades such as failure of log cleaners due to large offsets while upgrading. -- The in-house test-suite that we built in order to: validate new kafka versions against our existing tooling and client-libraries, exercise the upgrade and rollback process and benchmark performance. -- The automation we built for safe and fast rolling upgrades and broker configuration deployment.
While many companies are embracing Apache Kafka as their core event streaming platform they may still have events they want to unlock in other systems. Kafka Connect provides a common API for developers to do just that and the number of open-source connectors available is growing rapidly. The IBM MQ sink and source connectors allow you to flow messages between your Apache Kafka cluster and your IBM MQ queues. In this session we will share our lessons learned and top tips for building a Kafka Connect connector. We'll explain how a connector is structured, how the framework calls it and some of the things to consider when providing configuration options. The more Kafka Connect connectors the community creates the better, as it will enable everyone to unlock the events in their existing systems.
We use machine learning to delve deep into the internals of how systems like Kafka work. In this talk I'll dive into what variables affect performance and reliability, including previously unknown leading indicators of major performance problems, failure conditions and how to tune for specific use cases. I'll cover some of the specific methodology we use, including Bayesian optimization, and reinforcement learning. I'll also talk about our own internal infrastructure that makes heavy use of Kafka and Kubernetes to deliver real-time predictions to our customers.
Getting Kafka running on Kubernetes is only step one of a journey to create a production-ready Kafka cluster. This talk walks through the other steps: 1) Monitoring and remediating faults. 2) Updates to Kubernetes nodes for clusters not using shared storage. 3) Automating Kafka updates and restarts. We present how to create fault-tolerant Kafka clusters on Kubernetes without sacrificing availability, durability, or latency. Learn about Lyft's overlay-free Kubernetes networking driver and how we use it to keep performance on par with non-Kubernetes clusters.
In this baller talk, we will be addressing the elephant in the room that no one ever wants to look at or talk about: security. We generally never want to talk about configuring security because if we do, we allocate risk of penetration by exposing ourselves to exploitation. However, this leads to a lot of confusion around proper Kafka security best practices and how to appropriately lock down a cluster when you are starting out. In this talk we will demystify the elephant in the room without deconstructing it limb by limb. We will give you a notion of how to configure the following for BOTH clients and servers: * TLS or Kerberos Authentication * Encrypt your network traffic via TLS * Perform authorization via access control lists (ACLs) We will also demonstrate the above with a GitHub repo you can try out for yourself. Lastly, we will present a reference implementation of oauth if that suits your fancy. All in all you should walk away with a pretty decent understanding of the necessary aspects required for a secure Kafka environment.
At Stitch Fix, we maintain a distributed Kafka Connect cluster running several hundred connectors. Over the years, we've learned invaluable lessons for keeping our connectors going 24/7. As many conference goers probably know, event driven applications require a new way of thinking. With this new paradigm comes unique operational considerations, which I will delve into. Specifically, this talk will be an overview of: 1) Our deployment model and use case (we have a large distributed Kafka Connect cluster that powers a self-service data integration platform tailored to the needs of our Data Scientists). 2) Our favorite operational tools that we have built for making things run smoothly (the jobs, alerts and dashboards we find most useful. A quick run down of the admin service we wrote that sits on top of Kafka Connect). 3) Our approach to end-to-end integrity monitoring (our tracer bullet system that we built to constantly monitor all our sources and sinks). 4) Lessons learned from production issues and painful migrations (why, oh why did we not use schemas from the beginning?? Pausing connectors doesn't do what you think it does... rebalancing is tricky... jar hell problems are a thing of the past, upgrade and use plugin.path!). 5) Future areas of improvement. The target audience member is an engineer who is curious about Kafka Connect or currently maintains a small to medium sized Kafka Connect cluster. They should walk away from the talk with increased confidence in using and maintaining a large Kafka Connect cluster, and should be armed with the hard won experiences of our team. For the most part, we've been very happy with our Kafka Connect powered data integration platform, and we'd love to share our lessons learned with the community in order to drive adoption.
Apache Kafka is an amazing piece of technology, that has been furiously adopted by companies all around the world to implement event-driven architectures. While its adoption continues to increase; the reality is that most developers often complain about the complexity of managing the clusters by themselves, which seriously decreases their ability to be agile. This talk will introduce Confluent Cloud, a service that offers Apache Kafka and the Confluent Platform so developers can focus on what they do best: the coding part.
Through interactive demos, it will be shown how to quickly reuse code written for standard Kafka APIs to connect to Confluent Cloud and doing some interesting stuff with it. This is a zero-experience-needed type of session, where the focus is on providing the first steps to beginners.
With the rising adoption of stream- and event-driven processing microservice architectures are becoming more and more complex. One challenge that many businesses face during the initial and ongoign development of these solutions is how to properly model and maintain dependencies between microservices. One specific example for this that will be used throughout this talk cleansing and enrichment of data that has been ingested into a streaming platform. For most use cases there are a lot of minor tasks that need to be performed on every piece of data before it is fully usable for processing. Some common examples are: normalize phone numbers, normalize street addresses, geocode addresses, lookup customer data and enrich record, ... Most of these tasks are completely independent of each other, but some have dependencies to be run before or after other tasks - geocoding, for example, should be done only after address normalization has finished. Defining and orchestrating a complex graph of these operations is no small feat. This talk will focus on outlining the requirements and challenges that one needs to solve when trying to implement a flexible framework for solving this use case. It will then build on these requirements and present the blueprint of a generic solution and show how Kafka and Kafka Streams are a perfect fit to address and overcome most challenges. This talk, while offering some technical details is mostly targeted at people at the architecture, rather than the code, level. Listeners will gain a thorough understanding of the challenges that stream processing offers but also be provided with generic patterns that can be used to solve these challenges in their specific infrastructure.
Early attempts at real-time business event streaming at Kroger was based on JSON formatted events. Modifications to the event formats occasionally broke downstream consumers, causing costly downtime. In the course of reimagining what an industrial strength streaming platform would look like, we decided to focus heavily on schema lifecycle and management as a foundation. The schema registry is a great service, but it's only one part of the schema lifecycle management process. Here are the core principles around schema management: (1) Event schema are expressed in Avro (2) New versions will be fully compatible with older versions (3) Event producers create, manage, and fully document event schemas (4) Avro Schemas are managed in git and represent the source of truth (5) Complex schemas can be broken into smaller reusable component schemas and referenced in larger schemas The CI/CD Build process, in conjunction with customized gradle plugins, perform the following: (1) Constructs the full event schemas from components into larger registerable schemas (2) Generates Java source code based on the event schemas (3) Checks compatibility with prior registered versions (4) Registers the new/updated version in the schema registry (5) Publishes generated JAR File into artifactory for producers & consumers (6) Other Source Code Generation (Future) (7) Publishes the schema into other metadata tools to help make them more discoverable (future)
Invitae is one of the fastest growing genetic information companies, whose mission is to bring comprehensive genetic information into mainstream medical practice to improve the quality of healthcare for billions of people. We have recently partnered with another lab, requiring an integration layer that was developed as part of a dizzying leap from a traditional Python service architecture to Scala Streaming applications on Kafka and Kubernetes. This presentation is our story, where we discuss challenges and solutions, error handling and resilience techniques, technology stack choices and compromises, tools and approaches we have developed, and general insights. Beyond engineering itself, our team's goal is enabling others to join in. Building an application entirely of Streams is a significant and in many ways liberating paradigm shift. In addition to learning to architect and understand how the application will behave and evolve, success depends on great tooling. We will show, for example, how we extended KStreams API to seamlessly include Avro Schema as part of our build and code infrastructure, completely automating SerDe derivation, introducing typed topics, and still supporting polyglot teams. Other highlights: - Self-healing streams with aggregation, and deciding when to crash - Connectors vs Streams for side effects - Scheduling with Streams - Deriving topology diagrams - Monitoring and metrics as Streams - Combining Avro, Swagger and code generation, plus avro4s vs avrohugger comparison - Typelevel Cats and its role in our success - http4s and hybrid testing
Using Kafka to Discover Events Hidden in your Database Anna McDonald Principal Software Developer, SAS Institute Is your RDBMS the center of everything? Do you have multiple applications, batch jobs, and direct consumers hitting your RDBMS? In order to cover everywhere an event takes place would you need to update multiple services? Moving to event-driven development can seem overwhelming when faced with this scenario; that's where Kafka comes in! Using Change Data Capture (CDC) and Kafka Streams gives you access to every single event in your space-if you only know where to look. In this talk we'll discuss: - The properties in a CDC message used in identifying events - Techniques for defining high-level events and how to sniff them out using Kafka Streams - How to define predicates for CDC events - Recommendations on handling events that require more than one table We'll wrap up by discussing how to structure your event schema to handle a mix of data derived and more traditionally produced events. Learn to be a certified event investigator FTW!
Hotstar is a media entertainment platform, and has a large user base in India. Last year during the Indian Premier League, Hotstar introduced "Watch N Play", a real-time cricket prediction game in which over 33 million unique users answered over 2 billion questions, won more than a 100 million rewards, built with Kafka as the backbone. In the game, the user guesses the outcome of the next ball. If he/she guesses right before the actual outcome, they score points to climb up the ladder and receive rewards along the way. Supporting potentially millions of users with differing stream times & device latencies; we used topics to separate logical streams & partitions to scale to support 1M requests/second. We'll talk about how we reduced the -end latency to make the user-experience as real-time as possible. We'll focus on the operational challenges that we faced and how we overcame them by building automation around it to make our operational cum lives easier. We also have a war story around ghost topic creation where after deletion; topics would get magically created without any create topic requests. This led us to some interesting revelations and very important learnings of how long-living producers and consumers with very short-lived topics are a recipe for disaster.
Apache Kafka is changing the way we build scalable and highly available software systems. Providing a simplified path to eventual consistency and event sourcing Kafka gives us the platform to make these patterns a reality for a much broader segment of applications and customers than was possible in the past. Cloud Events is an interoperable specification for eventing that is part of the CNCF. This session will combine open source and open standards to show you how you can build highly reliable application that scale linearly, provide interoperability and are easily extensible leveraging both push and pull semantics. Concrete real world examples will be shown of how Kafka makes event sourcing more approachable and how streams and events complement each other including the difference between business events and technical events.
Have you ever tried to debug a production outage, when your system comprises apps your team has written, third-party apps your team runs, with logs going into some system, application performance metrics going into another system, and cloud platform metrics going somewhere else? Did you find yourself switching tabs, trying to correlate metrics with logs and alerts and finding yourself in a huge tangle? It is a nightmare. In the data world, we talk about aggregating all our data so we can derive new insights quickly, but what about our operational data? Observability is your ability to be able to ask questions of your system without having to write new code, or grab new data. When you've got an observable system, it feels like you have debugging superpowers, but can be challenging to even know where to start. If you can even convince your colleagues to start, finding the right tools can be challenging. In this talk Inny and Andrew will talk about what monitoring and logging are not sufficient anymore (if they ever were), observability basics, and demo an observability platform that you can use to start your observability journey today.
In this talk we'll look at the relationship between three of the most disruptive software engineering paradigms: event sourcing, stream processing and serverless. We'll debunk some of the myths around event sourcing. We'll look at the inevitability of event-driven programming in the serverless space and we'll see how stream processing links these two concepts together with a single 'database for events'. As the story unfolds we'll dive into some use cases, examine the practicalities of each approach-particularly the stateful elements-and finally extrapolate how their future relationship is likely to unfold. Key takeaways include: The different flavors of event sourcing and where their value lies. The difference between stream processing at application- and infrastructure-levels. The relationship between stream processors and serverless functions. The practical limits of storing data in Kafka and stream processors like KSQL.
Event-based stream processing is a modern paradigm to continuously process incoming data feeds, e.g. for IoT sensor analytics, payment and fraud detection, or logistics. Machine Learning / Deep Learning models can be leveraged in different ways to do predictions and improve the business processes. Either analytic models are deployed natively in the application or they are hosted in a remote model server. In the latter you combine stream processing with RPC / Request-Response paradigm instead of direct doing direct inference within the application. This talk discusses the pros and cons of both approaches and shows examples of stream processing vs. RPC model serving using Kubernetes, Apache Kafka, Kafka Streams, gRPC and TensorFlow Serving. The trade-offs of using a public cloud service like AWS or GCP for model deployment are also discussed and compared to local hosting for offline predictions directly "at the edge".
In our global economy, businesses must be nimble and often have to adapt quickly. As a result, many businesses structure their teams in an Agile way to keep up with this demand. With Domain Driven Design, it's possible to quickly modify applications to accommodate changing business needs and easily integrate with disparate third-party systems. But what if you were able to use historical data and analytics to enhance your applications capabilities? You may not always have all the necessary information in your CRUD database. With Event Sourcing, you are able to store new application events as well as existing events resulting in more robust applications. This makes it possible to change your application in ways you have never imagined before! This talk will show you how you can design your Event Sourcing based application with Spring Boot and Apache Kafka.
Kafka Streams and Kafka Connect provide tools to consume and produce from Kafka, but, as services are built out, how do we know how well our system is doing? Is our AMQP broker introducing increasing round-trip time? How many messages are being consumed by our application and where do these messages come from? We could add counters with a Kafka Streams processes, but how would we do the same for Kafka Connectors? Many of these problems can be solved by leveraging interceptors, Prometheus, and OpenTracing. By using interceptors we can quickly instrument new applications and connectors to provide observability into your entire stack. By the end of this talk, you'll learn how to utilize interceptors throughout your Kafka Stream applications and Kafka Connect cluster to provide additional observability. We'll demonstrate how you can use these interceptors to provide both tracing using OpenTracing and monitoring using Prometheus. Additionally, we'll highlight the pros and cons of using interceptors for observability compared to existing methods. As an attendee of Kafka Summit 2018 at SF, I gained insight on how other companies use Kafka with IoT devices and I look forward to the opportunity to share the knowledge our team has gained since attending the summit. Thank you for your consideration.
Is your organization adopting Kafka as their messaging bus but you've found that it will take too long to migrate your existing service-oriented architecture to a log-oriented architecture? Some of the biggest challenges in building a new stream processor can be implementing all the business logic again. It has become increasingly common for companies with high-throughput source streams and change-data-capture logs to want to build systems fast. At Ticketmaster, we have found a solution to the problem by leveraging the business logic in our existing services and calling them from our Java based KafkaStreams processor applications in an efficient manner. In this talk, we will examine the initial challenges we faced in our transition, then we will explore the solutions we built to address the use cases at Ticketmaster. The primary focus will address our workflow around calling services to bring stream processor applications to market fast. We will review our challenges and share tips for success.
Ever wondered just how many CPU cores of KSQL Server you need to provision to handle your planned stream processing workload ? Or how many GBits of aggregate network bandwidth, spread across some number of processing threads, you'll need to deal with combined peak throughput of multiple queries ? In this talk we'll first explore the basic drivers of KSQL throughput and hardware requirements, building up to more advanced query plan analysis and capacity-planning techniques, and review some real-world testing results along the way. Finally we will recap how and what to monitor to know you got it right!
Apache Kafka's Streams API lets us process messages from different topics with very low latency. Messages may have different formats, schemas and may even be serialised in different ways. What happens when an undesirable message comes in the flow? When an error occurs, real-time applications can't always wait for manual recovery and need to handle such failures. Kafka Streams lets you use a few techniques like sentinel value or dead letter queues-in this talk we'll see how. This talk will give an overview of different patterns and tools available in the Streams DSL API to deal with corrupted messages. Based on a real-life use case, it also includes valuable experiences from building and running Kafka Streams projects in production. The talk includes live coding and demonstrations.
Food processing fits trivially to stream processing when it comes to a digital twin. Today, the food processing industry is not well connected but there are huge potentials if data is getting shared. For example, the IoT (Internet of Things) data at the very first stage of the value chain has an impact on downstream steps. Integrating data of the food value chain can be easily done by applying streaming processing using Apache Kafka and in particular KSQL. In this talk, we will deep dive into how we stream data of the quality of fish and poultry collected from a factory in real-time and how KSQL plays a significant role. Since the digitization journey of BAADER is not only to establish powerful tools rather than establishing a new kind of culture where Apache Kafka helps in general. Beyond collecting product data, streaming machine data becomes crucial as well when state-of-the-art predictive services are provided. However, Apache Kafka with the capability of strictly ordered of messages allows us precisely analyze machine data at any point in time by just moving the offset slider.
KSQL is the streaming SQL engine for Apache Kafka. It provides an easy and completely interactive SQL interface for stream processing on Kafka. Users can express their processing logic in SQL like statements and KSQL will compile and execute them as Kafka Streams applications. Although KSQL provides a rich set of features and built in functions, many use cases require more domain specific processing logic that cannot be expressed in pure SQL. To enable users to use KSQL in such scenarios, KSQL provides a framework to define complex processing logic as User Defined Functions (UDFs) and User Defined Aggregate Functions (UDAFs). In this talk, we provide a deep dive into the UDF/UDAF framework in KSQL. We explain how users can define their custom UDFs/UDAFs and use them in their queries. We also describe how KSQL utilizes the provided UDFs/UDAFs under the hood to process streams and tables. This deep dive will include an insight into how UDFs process data and how UDAFs keep track of their state. Armed with such knowledge, KSQL users will be able to define and utilize complex data processing logic in their KSQL queries. They will also be able to diagnose and fix issues in defining and deploying their UDFs/UDAFs more efficiently.
Since its initial release, the Kafka group membership protocol has offered Connect, Streams and Consumer applications an ingenious and robust way to balance resources among distributed processes. The process of rebalancing, as it's widely known, allows Kafka APIs to define an embedded protocol for load balancing within the group membership protocol itself. Until now, rebalancing has been working under the simple assumption that every time a new group generation is created, the members join after first releasing all of their resources, getting a whole new load assignment by the time the new group is formed. This allows Kafka APIs to provide task fault-tolerance and elasticity on top of the group membership protocol. However, due to its side-effects on multi-tenancy and scalability this simple approach in rebalancing, also known as stop-the-world effect, is limiting larger scale deployments. Because of stop-the-world, application tasks get interrupted only for most of them to receive the same resources after rebalancing. In this technical deep dive, I'll discuss the proposition of Incremental Cooperative Rebalancing as a way to alleviate stop-the-world and optimize rebalancing in Kafka APIs. We'll cover: * The internals of Incremental Cooperative Rebalancing * Uses cases that benefit from Incremental Cooperative Rebalancing * Implementation in Kafka Connect * Performance results in Kafka Connect clusters
Robin is a Developer Advocate at Confluent, the company founded by the original creators of Apache Kafka, as well as an Oracle Groundbreaker Ambassador. His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop, and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing and optimization. He blogs at http://cnfl.io/rmoff and http://rmoff.net/ and can be found tweeting grumpy geek thoughts as @rmoff. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.
Streams and Tables are the foundation of event streaming with Kafka, and they power nearly every conceivable use case, from payment processing to change data capture, from streaming ETL to real-time alerting for connected cars, and even the lowly WordCount example. Tables are something that most of us are familiar with from the world of databases, whereas Streams are a rather new concept. Trying to leverage Kafka without understanding tables and streams is like building a rocket ship without understanding the laws of physics-a mission bound to fail. In this session for developers, operators, and architects alike we take a deep dive into these two fundamental primitives of Kafka's data model. We discuss how streams and tables incl. global tables relate to each other and to topics, partitioning, compaction, serialization (Kafka's storage layer), and how they interplay to process data, react to data changes, and manage state in an elastic, scalable, fault-tolerant manner (Kafka's compute layer). Developers will understand better how to use streams and tables to build event-driven applications with Kafka Streams and KSQL, and we answer questions such as "How can I query my tables?" and "What is data co-partitioning, and how does it affect my join?". Operators will better understand how these applications will run in production, with questions such as "How do I scale my application?" and "When my application crashes, how will it recover its state?". At a higher level, we will explore how Kafka uses streams and tables to turn the Database inside-out and put it back together.
The last 5 years, Kafka and Flink have become mature technologies that have allowed us to embrace the streaming paradigm. You can bet on them to build reliable and efficient applications. They are active projects backed by companies using them in production. They have a good community contributing, and sharing experience and knowledge. Kafka and Flink are solid choices if you want to build a data platform that your data scientists or developers can use to collect, process, and distribute data. You can put together Kafka Connect, Kafka, Schema Registry, and Flink. First, you will take care of their deployment. Then, for each case, you will setup each part, and of course develop the Flink job so it can integrate easily with the rest. Looks like a challenging but exciting project, isn't it? In this session, you will learn how you can build such data platform, what are the nitty-gritty of each part, how you can plug them together, in particular how to plug Flink in the Kafka ecosystem, what are the common pitfalls to avoid, and what it requires to be deployed on kubernetes. Even if you are not familiar with all the technologies, there will be enough introduction so you can follow. Come and learn how we can actually cross the streams!
When Funding Circle needed to scale its lending platform, we chose Kafka and Clojure. More than a programming language, Clojure is an interactive development environment with which you can build up an application function by function in a continuous unbroken flow. Since 2016 we have been developing our lending platform using Clojure and Kafka Streams, and today we process millions of transaction dollars daily. In 2018 we released "Jackdaw", our open-source Clojure library for working with Kafka Streams. In this talk, attendees will learn a radical new approach to building stream processing applications in a highly productive environment--one they can use immediately via Jackdaw or apply to their favorite programming system.
Let's talk about risks and pricing in insurance: From an insurance company the customer expects a fair (and affordable) tariff. How can we offer this, especially if the tariff model is very static? With KSQL, we are building the entire Processing Piplines directly in Kafka. With each deal we can re-evaluate the overall risk and learn from each claim. With each quote request, we understand the market better. And with this knowledge, we can adjust prices in real time to keep it cheap for the customer and still make some money. We expect peaks with twenty requests per second in Q4 and our partners allows us only one second to stick a price tag on the quote. Therefore we need a system that is fast, scalable and reliable. The central point is Confluent Kafka with a heavy use of streaming processing with KSQL. We start with the insurance product on 01.10. in the German market and look exclusively at architecture and function.
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
A Kafka cluster stores streams of records (messages) in categories called topics. It is the architectural backbone for integrating streaming data with a Data Lake, Microservices and Stream Processing. Today's enterprises have their core systems often implemented on top of relational databases, such as the Oracle RDBMS. Implementing a new solution supporting the digital strategy using Kafka and the ecosystem can not always be done completely separate from the traditional legacy solutions. Often streaming data has to be enriched with state data which is held in an RDBMS of a legacy application. It's important to cache this data in the stream processing solution, so that It can be efficiently joined to the data stream. But how do we make sure that the cache is kept up-to-date, if the source data changes? We can either poll for changes from Kafka using Kafka Connect or let the RDBMS push the data changes to Kafka. But what about writing data back to the legacy application, i.e. an anomaly is detected inside the stream processing solution which should trigger an action inside the legacy application. Using Kafka Connect we can write to a database table or view, which could trigger the action. But this not always the best option. If you have an Oracle RDBMS, there are many other ways to integrate the database with Kafka, such as Advanced Queueing (message broker in the database), CDC through Golden Gate or Debezium, Oracle REST Database Service (ORDS) and more. In this session, we present various blueprints for integrating an Oracle RDBMS with Apache Kafka in both directions and discuss how these blueprints can be implemented using the products mentioned before.
Rakuten Card, being the No 1 credit card issuer in Japan issues a new card every 8 seconds. The journey of customer application data starts from the time the user applies for a card and is used across each and every component of a Credit card system and remains even after a card has expired or a user no longer has need of it. This critical application data is used by scoring systems, fraud monitors, card printing companies, logistics, debt management, credit limit management, authorization and settlement systems. To date, the only way to access this application data has been via legacy scheduled batching. This legacy methodology not only blocks innovation but also has low maintainability, high operation cost, and time-consuming recovery of missing or corrupt data. At Rakuten Card, these complexities were removed by moving to a central event hub with Apache Kafka. This allows for real-time processing (collecting / transforming / delivering / analyzing) and the sharing of this data to different systems or databases in a vastly new way that reduces complexity and speeds time from recovery of any problems. What you'll learn at this talk: How Rakuten Card has removed blockers such as data loss and downtime How we created a fault tolerant, failure proof hub built using Apache Kafka Our strategic use of auto balanced Kubernetes cluster to manage Apache Kafka What we did to streamline the retrieval of application data How Rakuten continues to drive innovation in a traditionally conservative Japanese tech culture.
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classNameically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database. In this talk I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Key Takeaways => Use of techniques to services decomposition into a set of stages allowing code modularity and reuse. Good practices for dealing with DeadLetter, Monitoring, CorrelationID, Log, Base classNamees to control all software development best practices, Buffer Control in Apache Kafka and aspects related to Apache Kafka scalability and fault tolerance. Processing and management of high messages streaming on Black Friday (~ 25.4 million / day) Context => After a retrospective of how our structure behaved during the last Black Friday, we learned a few lessons and decided to adopt a new approach to address some specific scenarios which have millions of messages, ensuring resilience, uptime of at least 99.9%, monitoring and alerts for each module. We decided to adopt the SEDA architecture standard to traffic these millions of messages as closely as possible and deliver the desired quality to the target systems with scalability and reliability. By separating the pipeline processing modules, we were able to scale each of these modules horizontally, increasing the number of PODs (Openshift) and partitions of Kafka topics in order to process a given pipeline step faster. In addition, we also need to apply tunnings to Apache Kafka, one of which concerns the guarantee of delivery of the message. The focus of this presentation is to show the solution designed and how we use Apache Kafka and the SEDA architecture standard to orchestrate this massive stream of data we face. | gdocs url => https://tinyurl.com/seda-via-varejo
This is a story about what happens when a distributed system becomes a big part of a small team's infrastructure. This distributed system was Kafka and the team size was one engineer. I will discuss my failures along with my journey of deploying Kafka at scale with very little prior distributed systems experience. In this presentation, we will discuss how unique insights in the following organization culture, engineering and metrics created tailwinds and headwinds. This presentation will be a tactical approach to conquering a complex system with an understaffed team while your business is growing fast. I will discuss how the use case and resilience requirements for our Kafka cluster change as the user base grew from 100K users to over 6 million.
Namely is a late-stage startup that builds HR, Payroll and Benefits software for mid-sized businesses. Over the years, we've ended up with a number of monolithic and legacy applications covering overlapping domain concepts, which has limited our ability to deliver new and innovative features to our customers. We need a way to get our data out of the monoliths to decouple our systems and increase our velocity. We've chosen Kafka as our way to liberate our data in a reliable, scalable and maintainable way. This talk covers specific examples of successes and missteps in our move to Kafka as the backbone of our architecture. It then looks to the future - where we are trying to go, and how we plan on getting, both from the short term and long term perspectives. Key Takeaways: - Successful and unsuccessful approaches to gradually introducing Kafka to a large organization in a way that meets the short and long term needs of the business. - Successful and unsuccessful patterns for using Kafka. - Pragmaticism versus purisim: Building Kafka-first systems, and migrating legacy systems to Kafka with Debezium. - Combining event driven systems with RPC based systems. Observability, alerting and testing. - Actionable steps that you can take to your organization to help drive adoption.
Recursion Pharmaceuticals is turning drug discovery into a data science problem. This entails producing and processing petabytes of microscopy images from carefully designed biological experiments. In early 2017 the data production effort in our laboratory scaled to a point where the existing naive batch processing system was not reliably processing the data. The batch approach was also introducing unwanted lag between experiment image capture time and analysis results since an entire experiment, potentially 8TB+, would not begin processing until all the images were available. This was particularly troublesome for our laboratory as they wanted real time quality control metrics on the images. All of these reasons motivated us to replace the batch processing system with a streaming approach. The original data pipeline was implemented as microservices with no central orchestrator but instead relied on implicit flow between the services. The lack of visibility and robustness made the pipeline difficult and costly to operate. We wanted to address these concerns but also avoid rewriting the existing microservices. By building on top of Kafka Streams we created a flexible, highly available, and robust pipeline which leveraged our existing microservices giving us a clear migration path. This presentation will walk you through our thought process and explain the tradeoffs between using Kafka Streams and Spark for our specific use case. We'll dive into the details of the workflow system we created on top of Kafka Streams that orchestrates these microservices. We've been operating with this system since mid 2017 and the additional scale and robustness has played a key role in enabling Recursion to succeed in its mission of discovering new treatments for various diseases. The messages flowing over our Kafka Streams have already led to clinical trials in humans and will hopefully translate into meaningful impact in patients lives one day.
The field of astronomy is rapidly changing away from the traditional notion of a lone astronomer pointing a telescope at a single object in a static sky. Initiatives such as the Sloan Digital Sky Survey have ushered in a collaborative big data era of wide-field sky surveys, in which telescopes collect observations continuously while sweeping across the visible night sky. This method of data collection enables not only very deep imaging of far and faint objects but is also optimal for searching for objects that might be changing or moving. By analyzing the differences in astronomical image data from one night to the next, astronomers can detect "transient" objects, such as variable stars, supernova, and near Earth asteroids. New sky surveys provide a wealth of scientific value for astronomers but not without technical challenges. Survey data need to be automatically processed and the results immediately distributed to the scientific community in order to enable rapid follow-up observations as transient astronomy can be highly time sensitive. Detection alert data distribution mechanisms need to be robust and reliable to maintain scientific integrity without data loss. Additionally, alerting systems need to be scalable to support a data volume unprecedented in astronomy, as transient detection rates have increased to exceed all historical data in a single night. A streaming architecture is an ideal architecture for automated distribution and processing of transient data in real time as it is being collected. In this talk, we will discuss how Kafka and Avro are being used in wide-field astronomical sky survey pipelines to serialize and distribute transient data, the design choices behind this system, and how this alert stream system has been successfully deployed in production to distribute transient detection alerts to the scientific research community in excess of 1 million events per night.
Synopsis How often were you told that you have to stream data at scale, process and analyze them to make real time decision making without losing a single event? How often are you told that the scale of the data we are talking about is in several billions and the cost of one message can go up to 10s of 1000s of dollars? How often did you have to deal with the challenge of doing real-time decision making, analytics, ML and auditing with data in motion leave apart the data at rest? Real Time Inventory and Replenishment System We have the requirement to develop a system that will enable us to do real time tracking of items moving within the supply chain as it is vital for making quicker replenishment decisions and other real-time use cases. To fulfill this requirement, we chose to build an event-driven system that will track this inventory information and create plans and orders in near real time. We have events at the heart of the systems. Through this journey to meet scale with reliability, we learnt a lot of lessons to leverage Kafka at scale and various optimized ways to produce and consume from Kafka. We look forward to meet you all and discuss in detail our journey and connect you with solutions to some of your problems. Key takeaways: - Leveraging kafka and the related ecosystem on Openstack and Azure - Saving Cost at scale with Kafka and related eco-system - Scaling Kafka Streams and Kafka Connector applications. - Tuning Kafka Streams to improve performance. - How to stabilize the kafka connectors operating at scale.
Netflix Studio spent 8 Billion dollars on content in 2018. When the stakes are so high, it is paramount to track changes to the core studio metadata, spend on our content, forecasting and more to enable the business to make efficient and effective decisions. Embracing a Kappa architecture with Kafka enables us to build an enterprise grade message bus. By having event processing be the de-facto paved path for syncing core entities, it provides traceability and data quality verification as first className citizens for every change published.This talk will also get into the nuts and bolts of the eventing and stream processing paradigm and why it is the best fit for our use case, versus alternative architectures with similar benefits We will do a deep dive into the fascinating world of Netflix Studios and how eventing and stream processing are revolutionizing the world of movie productions and the production finance infrastructure.
Cloud migration: it's practically a rite of passage for anyone who's built infrastructure on bare metal. When we migrated our 5-year-old Kafka deployment from the datacenter to GCP, we were faced with the task of making our highly mutable server infrastructure more cloud-friendly. This led to a surprising decision: we chose to run our Kafka cluster on Kubernetes. I'll share war stories from our Kafka migration journey, explain why we chose Kubernetes over arguably simpler options like GCP VMs, and present the lessons we learned while making our way toward a stable and self-healing Kubernetes deployment. I'll also go through some improvements in the more recent Kafka releases that make upgrades crucial for any Kafka deployment on immutable and ephemeral infrastructure. You'll learn what happens when you try to run one complex distributed system on top of another, and come away with some handy tricks for automating cloud cluster management, plus some migration pitfalls to avoid. And if you're not sure whether running Kafka on Kubernetes is right for you, our experiences should provide some extra data points that you can use as you make that decision.
Kafka at ING has a long history. It all started in 2014 when Kafka was introduced to support our use-cases for fraud detection. The following years saw Kafka growing until 2018 when it took the spotlight at ING as the #1 searched-for technology with an unprecedented adoption curve. Suddenly what was a small trickle of niche use-cases, became a flood of customers on-boarding for every imaginable usage pattern. Even more astonishing was that our single cluster configuration saw almost 700% load increase just in 2018. During the explosion our team was so busy with on-boarding, maintenance and ops that it wasn't clear whether we were supporting or sabotaging our long-term success. It was clear that we needed to challenge our view of how we used Kafka. We asked ourselves: given the demand, how can our clients easily manage their streams while not caring about scaling, clusters or technologies. How to provide Kafka to all 40 ING markets with a single experience regardless of country, app or use-case? To answer that we had a to undertake a paradigm shift - from a single-cluster on premise to multi-cluster hybrid cloud, from ops to self-service, from cluster-based to event-first architecture. In this talk we share our journey of how we completely re-imagined Kafka at ING while serving our clients at lightning speed. We would like to discuss our experience of running a single cluster with more than 1000 topics and show how we made Kafka truly self-service via our Streaming Marketplace. Last, but not least, how the past years of success with Kafka have given us the courage to go all-in on the event-first thinking and never think back, putting ING once again one step ahead. Co-presented by Filip Yonov, Product Owner Kafka @ ING
NASA's Deep Space Network (DSN) operates spacecraft communication links for NASA deep-space spacecraft missions, including the Curiosity Rover, the Voyager twin spacecraft, Galileo, New Horizons, etc., and has done so reliably for over fifty years. The DSN Complex Event Processing (DCEP) software assembly is a new software system being deployed worldwide into NASA's DSN Deep Space Communication Complexes (DSCC's), including facilities in Spain, Australia, and the United States. The system brings into the DSN next-generation "Big Data" and "Fast Data" infrastructural tools, including Apache Kafka, for correlating real-time network data with other critical data assets, including predicted antenna pointing parameters and extensive logging of physical hardware in the DSN. The ultimate use case is to ingest, filter, store, and visualize all of the DSN's monitor and control data and to actively ensure the successful DSN tracking, ranging, and communication integrity of dozens of concurrent deep-space missions. The system is also intended to support future autonomy applications, including automated anomaly detection in real-time network monitor streams and automated reconfiguration of antenna related assets as needed by future, increasingly autonomous spacecraft. This talk will focus upon the software system behind DCEP, and introduce novel approaches to increasing NASA spacecraft link-control operator cognizance into anomalies that may and do occur during spacecraft tracking activities. This talk will also offer lessons learned, and provide a glimpse into one of the most unique, "out-of-this-world", applications of Apache Kafka.
Tesla ingests trillions of events every day from hundreds of unique data sources through our streaming data platform. Find out how we developed a set of high-throughput, non-blocking primitives that allow us to transform and ingest data into a variety of data stores with minimal development time. Additionally, we will discuss how these primitives allowed us to completely migrate the streaming platform in just a few months. Finally, we will talk about how we scale team size sub-linearly to data volumes, while continuing to onboard new use cases.
Centene is fundamentally modernizing its legacy monolithic systems to support distributed, real-time event-driven healthcare information processing. A key part of our architecture is the development of a universal eventing framework to accommodate transformation into an event-driven architecture (EDA). Our application provides a representational state transfer (REST) and remote procedure call (gRPC) interface that allows development teams to publish and consume events with a simple Noun-Verb-Object (NVO) syntax. Embedded within the framework are structured schema evolutions with Confluent Schema Registry and AVRO, configurable (self-service) event-routing with K-Tables, dynamic event-aggregation with Kafka Streams, distributed event-tracing with Jaeger, and event querying against a MongoDB event-store hydrated by Kafka Connect. Lastly, we developed techniques to handle long-term event storage within Kafka; specifically surrounding the automated deletion of expired events and re-hydration of missing events. In Centene's first business use case, events related to claim processing of provider reconsiderations was used to provide real-time updates to providers on the status of their claim appeals. To satisfy the business requirement, multiple monolith systems independently leveraged the event framework, to stream status updates for display on the Centene Provider Portal instantly. This provided a capability that was brand new to Centene: the ability to interact and engage with our providers in real-time through the use of event streams. In this presentation, we will walk you through the architecture of the eventing framework and showcase how our business requirements within our claims adjudication domain were able to be solved leveraging the Kafka Stream DSL and the Confluent Platform. And more importantly, how Centene plans on leveraging this framework, written on-top of Kafka Streams, to change our culture from batch processing to real-time stream processing.