Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
View sessions and slides from Kafka Summit 2020
Let's begin a very unusual Kafka Summit by reflecting about change. Changes we've seen in the software engineering world and changes we've seen in Kafka. We'll also talk about things that don't change - like great software design and architecture. We'll dive deep into two huge changes that are happening in the Kafka community right now - and the possibilities they open for the future.
In this talk, Confluent co-founder and CEO, Jay Kreps will cover the rise of two trends:
1. The rise of Apache Kafka and event streams
2. The rise of the public cloud and cloud-native data systems ... and the problems we need to solve as these two trends come together.
Data is essential. It’s the lifeblood of our business, and without it we’re lost. The problem with data though, is all too often we can end up working for the data, rather than having the data work for us. So much of the systems we build are about managing data properly. Storing it safely, getting it where it needs to be. Making sure it is held safely, or that it is manipulated in the right way. The nature of the data we manage can end up constraining our applications in a host of ways. Rather than making data the constraint in our system, we need to find ways to better unlock the value it has for our organisations. In this keynote, Sam will look at how to reimagine the use of data to make sure the data works for us, not the other way around.
It’s likely that your organization is doing more in the cloud now than ever before. Chances are that if you’re not already running some or all of your event-streaming use cases in a private, public, or hybrid cloud environment, you will be soon. At this panel, Chirag Dadia, Director of Engineering at Nuuly joins Ravi Vankamamidi, Director of Technology at Expedia Group for a Q&A with Confluent co-founder Jun Rao to share their transformative journeys with event streaming on a fully managed, cloud-native, Kafka-powered platform. In addition to covering the cost-effectiveness of reducing operational burden with Confluent Cloud*, the two will discuss the advantages of more reliable and efficient operations, new opportunities made possible by getting to market faster, and what increased speed-to-market has meant to their businesses and their bottom lines.
Join Confluent co-founder Jun Rao as he talks with Leon Stiehl of Citi about why and how Kafka is at the epicenter of seismic change in the way big banks and financial institutions think about data. Gone are the days of end-of-day batch processes and static data locked in disparate silos. Today’s data is in motion, and Citi is moving with it. Enterprises need to see their business in near real time; they need to respond in milliseconds not hours; and they need to integrate, aggregate, curate, and disseminate data and events within and across production environments. Learn how Citi is tackling these challenges, using event streaming to drive efficiencies and improve customer experiences.
Humana is at the forefront of an industry-wide effort to improve health outcomes while keeping costs in check through improved interoperability. Levi Bailey, Associate Vice President, Cloud Architecture, will share Humana’s vision and how an event-driven architecture, built with Kafka and Confluent, powers the interoperability platforms at the heart of the company’s digital transformation.
In this Q&A keynote, Confluent co-founder Jun Rao and Lowe’s Domain Architect Bhanu Solleti will discuss the central role that Kafka and real-time event streaming play at one of the world’s largest home improvement retailers. At Lowe’s, Kafka and Confluent Platform form the core of an event-driven architecture that connects systems across cloud platforms, data centers, and brick-and-mortar stores. This architecture, along with enterprise-ready solutions like replicators and connectors, are helping Lowe’s improve time-to-market and respond quickly to shifting store hours, increased curbside pickup, and a range of other new business imperatives that have emerged during the pandemic.
Organizations, people, and open-source event streaming platforms all go through defined developmental stages. A look back at some of the content of Kafka Summit 2020 helps us see what to look forward to tomorrow.
Running applications across two data centers is a requirement for many industries. Understanding how to deploy and architect a Kafka Streams application for multiple data centers can seem daunting for both developers and operators. Both stretch clusters and replication present unique challenges. This talk will go over best practices and answer questions such as, should I replicate internal topics? What are the implications of exactly once semantics? Do I need to run active/active or active/passive? How do I minimize recovery time after a failure? We’ll discuss important issues for stretch clusters such as rack/dc placement of internal topic partitions, state store gotchas and common latency vs throughput trade offs. The patterns presented will enable you to confidently design and execute resilient Kafka Streams applications.
Machine Learning (ML) is separated into model training and model inference. ML frameworks typically use a data lake like HDFS or S3 to process historical data and train analytic models. But it’s possible to completely avoid such a data store, using a modern streaming architecture.
This talk compares a modern streaming architecture to traditional batch and big data alternatives and explains benefits like the simplified architecture, the ability of reprocessing events in the same order for training different models, and the possibility to build a scalable, mission-critical ML architecture for real time predictions with muss less headaches and problems.
The talk explains how this can be achieved leveraging Apache Kafka, Tiered Storage and TensorFlow.
Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again! Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with Kafka Connect, which is part of Apache Kafka. ksqlDB is the source-available SQL streaming engine for Apache Kafka and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.
In this talk, we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, Kafka Connect, and ksqlDB.
Gasp as we filter events in real-time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!
We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Consumer Group feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into the various challenges faced and considerations that went into this design, and learn what makes Pinot resilient to failures both in Kafka Brokers and Pinot Components. We will introduce the new concept of "lockstep" sequencing where multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.
I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.
After migrating much of our tech stack to Kafka Streams and Kafka Connect, some of our newer streaming applications encountered the same problem: while their primary input data was available in Kafka and their static side input data could be ingested using traditional connectors, they still needed access to semi-static data that has historically been acquired through in-line synchronous requests and responses. External service calls made from within a Kafka Streams application are highly discouraged, as they make error-handling difficult and typically slow down the application. To solve this problem, we used Kafka Connect components to build a large-scale data subscription service which subscribes to, polls, and disseminates auxiliary data to Kafka topics for immediate consumption by our streaming applications.
This talk will cover:
• An overview of our subscription service architecture
• A deep-dive into our Kafka Connector and each of its custom components
• Design considerations and various pitfalls that we encountered along the way
Since Pac-Man was originally released in the '80s, it has been a beacon of fun and joy for people of all ages. What few people know is that this game can also be used to inspire developers on how to build event streaming applications. In this near-zero-slides talk, attendees will get to play the game to generate events. As they play, the presenter will write from scratch a scoreboard using ksqlDB -- an open-source event streaming database built for Apache Kafka.
After building the scoreboard, it will be discussed the different strategies to make the data available elsewhere so any interested service could leverage it with ease. Examples of these services will be provided to monitor in near real-time the scoreboard, revealing whoever is the most proficient Pac-Man player in the room.
Event Modeling is a fairly new information system modeling discipline created by Adam Dymitruk that is heavily influenced by CQRS and Event Sourcing. Its lineage follows from Event Storming, Design Thinking, and other modeling practices from the Agile and Domain-Driven Design communities. The methodology emphasizes simplicity (there are only four model ingredients) and inclusion of non-developer participants.
Like other modeling disciplines, Event Modeling is sufficiently general to enable collaborative learning and knowledge exchange among UI/UX designers, software engineers and architects, and business domain experts. But it's also sufficiently expressive and specific to be directly actionable by the implementors of the information system described by the model. During this talk, we'll:
• Build an Event Model of a simple information system, including wire-framing the UI/UX experience
• Explore how to proceed from model to implementation using Kafka, its Streams and Connect APIs, and KSQL
• Jump-start the implementation by generating code directly from the Event Model
• Track and measure the work of implementation by generating tasks directly from the Event Model
What does a Kafka administrator need to do if they have a user who demands that message delivery be guaranteed, fast, and low cost? In this talk we walk through the architecture we created to deliver for such users. Learn around the alternatives we considered and the pros and cons around what we came up with.
In this talk, we’ll be forced to dive into broker restart and failure scenarios and things we need to do to prevent leader elections from slowing down incoming requests. We’ll need to take care of the consumers as well to ensure that they don’t process the same request twice. We also plan to describe our architecture by showing a demo of simulated requests being produced into Kafka clusters and consumers processing them in lieu of us aggressively causing failures on the Kafka clusters.
We hope the audience walks away with a deeper understanding of what it takes to build robust Kafka clients and how to tune them to accomplish stringent delivery guarantees.
Streams Change data capture (CDC) via Debezium is liberation for your data: By capturing changes from the log files of the database, it enables a wide range of use cases such as reliable microservices data exchange, the creation of audit logs, invalidating caches and much more.
In this talk we're taking CDC to the next level by exploring the benefits of integrating Debezium with streaming queries via Kafka Streams. Come and join us to learn:
• How to run low-latency, time-windowed queries on your operational data
• How to enrich audit logs with application-provided metadata
• How to materialize aggregate views based on multiple change data streams, ensuring transactional boundaries of the source database
We'll also show how to leverage the Quarkus stack for running your Kafka Streams applications on the JVM, as well as natively via GraalVM, many goodies included, such as its live coding feature for instant feedback during development, health checks, metrics and more.
What do you do when you've two different technologies on the upstream and the downstream that are both rapidly being adopted industrywide? How do you bridge them scalably and robustly? At Wework, the upstream data was being brokered by Kafka and the downstream consumers were highly scalable gRPC services. While Kafka was capable of efficiently channeling incoming events in near real-time from a variety of sensors that were used in select Wework spaces, the downstream gRPC services that were user-facing were exceptionally good at serving requests in a concurrent and robust manner. This was a formidable combination, if only there was a way to effectively bridge these two in an optimized way. Luckily, sink Connectors came to the rescue. However, there weren't any for gRPC sinks! So we wrote one.
In this talk, we will briefly focus on the advantages of using Connectors, creating new Connectors, and specifically spend time on gRPC sink Connector and its impact on Wework's data pipeline.
Have you ever wished that KTable joins worked like SQL joins? Well, now they do! Foreign-key, many:one, joins were added to Apache Kafka in 2.4. This talk is a deep dive into the surprisingly complex implementation required to compute these joins correctly. Building on that understanding, we'll discuss how you can expect Streams to behave when you use the feature, including how to test it, and thoughts on optimization. Finally, we will take Bazaarvoice as a case study. They are in process on migrating from their high-scale in-house stream processing platform to one based on Apache Kafka and Kafka Streams. I'll share the way that they implemented foreign-key joins on Kafka 2.3, and how much simpler it is with native support. Plus, we will also share key operational insights from their experience.
The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.
This talk is aimed to give developers who are interested to scale their streaming application with Exactly-Once (EOS) guarantees. Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library. We would also present how the EOS code can be simplified with plain Producer and Consumer. Come to learn more if you wish to adopt this improved EOS feature and get started on building your own EOS application today!
SIEM platforms are essential to the new cybersecurity paradigm and data collection layer is a very important piece of it.
When you deliver a new platform, you can easily get lost in a variety of different vendors and solutions, too many challenges are facing. What if I change vendors, will I keep my data? How to feed multiple tools with the same data? How to collect data from custom apps and services? How to pay less for an expensive platform? How to keep data without a huge cost?
Join us if you are looking for the answers.
In this session, you will learn how we replaced the vendor-provided data collection layer with kafka connect and the lessons we learnt. After the talk you will know:
• architecture and real-life examples of the flexible and highly available data collection platform
• custom connectors that do most of the work for us and how to extend the connectors to consume new data, we made them open sourced
• easy way to receive data from thousands of servers and many cloud services
• how to archive data at low cost
You will leave armed with a set of free tools and recipes to build a truly vendor-agnostic data collection platform. It will allow you to take you SIEM costs under control. You will feed your analytics tools with what they need and archive the rest at low cost. You will feed your SIEM smart!
Responding to a global pandemic presents a unique set of technical and public health challenges. The real challenge is the ability to gather data coming in via many data streams in variety of formats influences the real-world outcome and impacts everyone. The Centers for Disease Control and Prevention CELR (COVID Electronic Lab Reporting) program was established to rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners. Confluent Kafka with KStreams and Connect play a critical role in program objectives to:
• Track the threat of COVID-19 virus
• Provide comprehensive data for local, state, and federal response
• Better understand locations with an increase in incidence
Running a multi-tenant Kafka platform designed for the enterprise can be challenging. You need to manage and plan for data growth, support an ever-increasing number of use cases, and ensure your developers can be productive with the latest tools in the Apache Kafka ecosystem — all while maintaining the stability and performance of Kafka itself.
At Bloomberg, we run a fully-managed, multi-tenant Kafka platform that is used by developers across the enterprise. The variety of use cases for Kafka leads to bursty workloads, latency-sensitive workloads, and topologies where partitions are fanned out across hundreds or thousands of consumer groups running side-by-side in the same cluster.
In this talk, we will give a brief overview of our platform and share some of our experiences and tools for running multi-tenant stretched clusters, managing data growth with compression, and mitigating the impact of various application patterns on shared clusters.
By nature Event-driven systems transform data and propagate it across multiple services. This characteristic makes the GDPR compliance challenging. Immutable Kafka logs make it impossible to explicitly delete a published message that may contain Personally Identifiable Information (PII). A general solution has been to choose a short-enough retention duration for such topics so that the data is eventually removed within the allowed time limit. As for consumers of the data, one typically has to audit and trace where the data is propagated, and request each of the consuming services to purge their copy. Even then PII may still continue to exist, for example in backups, intermediate stating environments like S3 buckets, and ad-hoc copies of the data used for business analytics, data science, etc.
This talk presents a way to build GDPR compliance into the message propagation protocol itself, and utilise crypto-shredding to in effect render all copies of PII decipherable on demand. The talk explains how a message schema such as ProtocolBuffer can be extended to allow publishers of data to mark data as PII. It shows how GDPR compliance can be integrated into existing APIs that were not designed with GDPR in mind, with minimum disruption. It illustrates how the marked data is encrypted before it is stored in Kafka and guarantee that the data remains encrypted throughout its entire propagation journey. The talk shows how the key management system works transparently across thousands of services to control access to data with different granularity and protect against cross referencing to avoid unauthorized access to data.
Contributing to an open source project can be a very rewarding experience especially when you realize how your contributions are transforming the project and impacting others in a positive way. However, as a newcomer to an open source project with several moving parts and a very rich ecosystem, getting started and getting involved as a contributor can be overwhelming sometimes. You might start to wonder: How does the project operate? What and how can I contribute? Where do I get even started?
In this session, we will cover how to go from zero to contributing in a very short time. We will also cover the benefits of contributing to an open source project such as Apache Kafka, why you should get involved, how you can contribute, and the various areas you can contribute to the project. Contributing typically starts with answering questions on the mailing list or StackOverflow, reporting bugs, or helping to maintain the documentation. Then, it goes all the way to submitting patches, participating in design discussions, and testing releases to name just a few.
Regardless of whether you are a newcomer or already contributing in some form to the project, you will find this session very valuable and a great way to jumpstart and/or accelerate your participation in the Apache Kafka project. Apache Kafka has a very open and friendly community and we are looking forward to meeting you on the mailing list!
More and more Enterprises are relying on Apache Kafka to run their businesses. Cluster administrators need the ability to mirror data between clusters to provide high availability and disaster recovery.
MirrorMaker 2, released recently as part of Kafka 2.4.0, allows you to mirror multiple clusters and create many replication topologies. Learn all about this awesome new tool and how to reliably and easily mirror clusters.
We will first describe how MirrorMaker 2 works, including how it addresses all the shortcomings of MirrorMaker 1. We will also cover how to decide between its many deployment modes. Finally, we will share our experience running it in production as well as our tips and tricks to get a smooth ride.
While Apache Kafka is designed to be fault-tolerant, there will be times when your Kafka environment just isn’t working as expected.
Whether it’s a newly configured application not processing messages, or an outage in a high-load, mission-critical production environment, it’s crucial to get up and running as quickly and safely as possible.
IBM has hosted production Kafka environments for several years and has in-depth knowledge of how to diagnose and resolve problems rapidly and accurately to ensure minimal impact to end users.
This session will discuss our experiences of how to most effectively collect and understand Kafka diagnostics. We’ll talk through using these diagnostics to work out what’s gone wrong, and how to recover from a system outage. Using this new-found knowledge, you will be equipped to handle any problem your cluster throws at you.
Last year, U.S. Citizenship and Immigration Services (USCIS) adopted a new strategy to accelerate our transition to a digital business model. This eProcessing strategy connects previously siloed technology systems to provide a complete digital experience that will shorten decision timelines, increase transparency, and more efficiently handle the 8 million requests for immigration benefits the agency receives each year.
To pursue this strategy effectively, we had to rethink and overhaul our IT landscape, one that has much in common with those other large enterprises in both the public and private sectors. We had to move away from antiquated ETL processes and overnight batch processing. And we needed to move away from the jumble of ESB, message queues, and spaghetti-stringed direct connections that were used for interservice communication.
Today, eProcessing is powered by real-time event streaming with Apache Kafka and Confluent Platform. We are building out our data mesh with microservices, CDC, and an event-driven architecture. This common core platform has reduced the cognitive load on development teams, who can now spend more time on delivering quality code and new features, less on DevSecOps and infrastructure activities. As teams have started to align around this platform, a culture of reusability has grown. We’ve seen a reduction in duplication of effort -- in some cases by up to 50% -- across the organization from case management to risk and fraud.
Join us at this session where we will share how we:
• Used skunkworks projects early on to build internal knowledge and set the stage for the eProcessing form factory that would drive the digital transition at USCIS
• Aggregated disparate systems around a common event-streaming platform that enables greater control without stifling innovation
• Ensured compliance with FIPS 140-2 and other security standards that we are bound by
• Developed working agreements that clearly defined the type of data a topic would contain, including any personally identifiable information requiring additional measures
• Simplified onboarding and restricted jumpbox access with Jenkins jobs that can be used to create topics in dev and other environments
• Implemented distributed tracing across all topics to track payloads throughout our entire domain structure
• Started using KSQL to build streaming apps that extract relevant data from topics among other use cases
• Supported grassroots efforts to increase use of the platform and foster cross-team communities that collaborate to increase reuse and minimize duplicated effort
• Established a roadmap for federation with other agencies, that includes replacing SOAP, SFTP, and other outdata data-sharing approaches with Kafka event streaming
Apache Kafka users who want to leverage Google Cloud Platform's (GCPs) data analytics platform and open source hosting capabilities can bridge their existing Kafka infrastructure on-premise or in other clouds to GCP using Confluent's replicator tool and managed Kafka service on GCP. Using actual customer examples and a reference architecture, we'll showcase how existing Kafka users can stream data to GCP and use it in popular tools like Apache Beam on Dataflow, BigQuery, Google Cloud Storage (GCS), Spark on Dataproc, and Tensorflow for data warehousing, data processing, data storage, and advanced analytics using AI and ML.
Testing stream processing applications (Kafka Streams and ksqlDB) isn’t always straightforward. You could run a simple topology manually and observe the results. But how about repeatable tests that you can run anytime, as part of a build without a Kafka cluster or Zookeeper? Luckily, Kafka Streams includes the TopologyTestDriver module (and ksqlDB includes test-runner) that allows you to do precisely that. After learning this, no doubt, your test coverage is sky-high! However, how will your stream processing application perform once deployed to production? You might depend on external resources such as databases, web services, and connectors. Viktor will start this talk covering the basics of unit testing of Kafka Streams applications using TopologyTestDriver. Viktor will also look at some popular open-source libraries for testing streams applications. Viktor demonstrates TestContainers, a Java library that provides lightweight, disposable instances of shared databases, Kafka clusters, and anything else that can run in a Docker container and how to use it for integration testing of processing applications! And lastly, Viktor will show ksqlDB’s test-runner to unit test your KSQL applications.
For quite some time, I had a fuzzy feeling that I didn’t really understand event streaming architectures and how they fit more broadly into the modern software architecture puzzle. Then I saw a concrete, real-life example from an airplane maintenance use case, where billions of sensor data points come in via Kafka and must be transformed into insights that occasionally lead to important actions a mechanic needs to take.
This story led to a personal revelation: Data-streams are passive in nature. On their own, they do not lead to any action. But at some point in time, actions must be taken. The action might be carried out by a human looking at data and reacting to it, or an external service that’s called, or a "traditional" database that’s updated, or a workflow that’s started. If there’s never any action, your stream is kind of useless.
Now, the transition from a passive stream to an active component reacting to an event in the stream is very interesting. It raises a lot of questions about idempotency, scalability, and the capability to replay streams with changed logic. For example, in the project mentioned above, we developed our own stateful connector that starts a workflow for a mechanic only once for every new insight, but can also inform that workflow if the problem no longer exists. Replaying streams with historic data did not lead to any new workflows created.
In this talk, I’ll walk you through the aircraft maintenance case study in as much detail as I can share, along with my personal discovery process, which I hope might guide you on your own streaming adventures.
Who is the best?
To answer this question once and for all we created a tool with which we can track scores, rankings and statistics. Why you might ask, it is simple, playing kicker, foosball or töggele is in our DNA. So is creating awesome tools with awesome components.
Kafka, especially integrated in Confluent Cloud, is one such component. Join us when we explain how we built a small utility to track games that we completely overengineered by using Kafka, KSQL, Quarkus and other fancy components. No matter what we tell others, we did it for fun. However, there are some real learnings we want to share that will help with building real-world streaming applications.
For a long time we discuss how much data we can keep in Kafka. Can we store data forever or do we remove data after a while and maybe having the history in a data lake on Object Storage or HDFS? With the advent of Tiered Storage in Confluent Enterprise Platform, storing data much longer in Kafka is much very feasible. So can we replace a traditional data lake with just Kafka? Maybe at least for the raw data? But what about accessing the data, for example using SQL?
KSQL allows for processing data in a streaming fashion using an SQL like dialect. But what about reading all data of a topic? You can reset the offset and still use KSQL. But there is another family of products, so-called query engines for Big Data. They originate from the idea of reading Big Data sources such as HDFS, object storage or HBase, using the SQL language. Presto, Apache Drill and Dremio are the most popular solutions in that space. Lately these query engines also added support for Kafka topics as a source of data. With that you can read a topic as a table and join it with information available in other data sources. The idea of course is not real-time streaming analytics but batch analytics directly on the Kafka topic, without having to store it in a big data storage.
This talk answers, how well these tools support Kafka as a data source. What serialization formats do they support? Is there some form of predicate push-down supported or do we have to always read the complete topic? How performant is a query against a topic, compared to a query against the same data sitting in HDFS or an object store? And finally, will this allow us to replace our data lake or at least part of it by Apache Kafka?
One of the key metrics to monitor when working with Apache Kafka, as a data pipeline or a streaming platform, is Consumer Groups Lag.
Lag is the delta between the last produced message and the last committed message of a partition. In other words, lag indicates how far behind your application is in processing up-to-date information.
For a long time, we used our own service to keep track of these metrics, collect them and visualize them. But this didn’t scale well.
You had to perform many manual operations, redeploy it and to do other tedious manual tasks, but most importantly, the biggest gap for us, was that its output was represented in absolute numbers (e.g - your lag is 30K), which basically tells you nothing as a human being.
We understood that we had to find a more suitable solution that will give us better visibility and will allow us to measure the lag in a time-based format that we all understand.
In this talk, I’m going to go over the core concepts of Kafka offsets and lags, and explain why lag even matters and is an important KPI to measure. I’ll also talk about the kind of research we did to find the right tool, what the options in the market were at the time, and eventually why we chose Linkedin’s Burrow as the right tool for us. And finally, I’ll take a closer look at Burrow, its building blocks, how we build and deploy it, how we monitor better with it, and eventually the most important improvement - how we transformed its output from numbers to time-based metrics.
PayPal currently processes tens of billions of signals per day from different sources in batch and streaming mode. The data processing platform is the one powering these different analytical needs and use cases, not just at PayPal but our adjacencies like Venmo, Hyperwallet and iZettle. End users of this platform demand access to data insights with as much flexibility as possible to explore it with low processing latency.
One such use case is where our Switchboard(data de-multiplexer) platform where we process approximately 20 billion events daily and provide data to different teams and platforms with-in PayPal and also to platform outside PayPal for more insights. When we started building this platform Kafka was just another asynchronous message processing platform for us but we have seen it evolving to a place where its adds value not just in terms of event processing but also for platform resiliency and scalability.
Takeaway for the audience: Most people work with and have knowledge about data. With this talk I want to present information which is relevant and meaningful to the audience. Information and examples which will make it easier for attendees to understand our complex system and hopefully have some practical takeaways to use Kafka for similar problems on their hand.
When working with KafkaConsumer, we usually employ single thread both for reading and processing of messages. KafkaConsumer is not thread-safe, so using single thread fits in well. Downside of this approach is that you are limited to single thread for processing messages.
By decoupling consumption and processing, we can achieve processing parallelization with single consumer and get the most out of multi-core CPU architectures available today. While this can be very useful in certain use-case scenarios, it's not trivial to implement.
How do we use multiple threads with KafkaConsumer which is not thread safe? How do we react to consumer group rebalancing? Can we get desired processing and ordering guarantees? In this talk we 'll try to answer these questions and explore challenges we face on our path.
Since its release in 2018, KSQL has grown from interesting curiosity into ksqlDB - a production grade streaming system. What does it look like to run KSQL in the enterprise? How has the promise of the Kafka Streams with an SQL dialect worked in the wild?
Let's explore stream processing with ksqlDB in the enterprise. How is it used to rapid prototyping; for taking an idea to production. Using the flexible scripting to help teams with error discover and system introspection. Plus how extended teams can use KSQL as a stepping stone for building and sharing real-time scoring and streaming insights.
This session will cover production deployments of ksqlDB in banking, finance, transport and insurance. What can go wrong, and what can go right. See how teams embrace the technology to solve stream processing challenges.
If your data platform is powered only by batch data processing, you know you are always trailing your customer. Your databases aren’t always up to date. Your inability to have a synchronized data flow across systems leads to operational inefficiencies. And, your dreams of running advanced real-time AI and ML applications can’t be fulfilled. However, you might be wary of the implications of turning your product into an event-driven one. In this presentation we’ll share our experience transforming our CDP-based marketing orchestration engine to be both real-time and highly scalable with the Kafka ecosystem. We will look into how we saved resources with Connect when ingesting and syncing data with NoSQL databases, data warehouses and third-party platforms. What we did to turn ksqlDB into our data transformation, aggregation and querying hub, reducing latency and costs. How Streams helps us activate multiple real-time applications such as building identity graphs, updating materialized views in high frequency for efficient real-time lookups and inferencing machine learning models. Finally, we will look at how Confluent Cloud solved our pre-rollout sizing and scaling questions, significantly reducing time-to-market.
Today, many companies that have lots of data are still struggling to derive value from machine learning (ML) and data science investments. Why? Accessing the data may be difficult. Or maybe it’s poorly labeled. Or vital context is missing. Or there are questions around data integrity. Or standing up an ML service can be cumbersome and complex.
At Nuuly, we offer an innovative clothing rental subscription model and are continually evolving our ML solutions to gain insight into the behaviors of our unique customer base as well as provide personalized services. In this session, I’ll share how we used event streaming with Apache Kafka® and Confluent Cloud to address many of the challenges that may be keeping your organization from maximizing the business value of machine learning and data science. First, you’ll see how we ensure that every customer interaction and its business context is collected. Next, I’ll explain how we can replay entire interaction histories using Kafka as a transport layer as well as a persistence layer and a business application processing layer. Order management, inventory management, logistics, subscription management – all of it integrates with Kafka as the common backbone. These data streams enable Nuuly to rapidly prototype and deploy dynamic ML models to support various domains, including pricing, recommendations, product similarity, and warehouse optimization. Join us and learn how Kafka can help improve machine learning and data science initiatives that may not be delivered to their full potential.
As your company moves through their digital transformation using Apache Kafka, how are they measuring their progress? Are they still using pre-digital data warehouse and data lake technology to answer the day-to-day operational questions? A modern company needs a modern real-time analytics platform to uncover the insights required to effectively understand how their digital transformation is affecting their business. Apache Kafka – with help from other real-time analytics technologies like Apache Druid and Apache Superset – can provide the tools to make this happen.
If you’re considering -- or planning -- a cloud migration, you may be concerned about risks to your data and your mental health. Migrations at scale are fraught with risk. You absolutely can’t lose data, compromise its integrity, or suffer downtime, so you want to be slow and careful. On the other hand, you’re paying two providers for every day the migration goes on, so you need to move as fast as possible.
Unity Technologies accumulates lots of data. We recently moved our data infrastructure as part of a major cloud migration from Amazon Web Services (AWS) to Google Cloud Platform (GCP).
To minimize risk and costs our team used Apache Kafka and Confluent Platform, while engaging Confluent Platform Professional Services to help ensure a speedy and seamless migration. Kafka was already serving as the backbone to our data infrastructure, which handles over half a million events per second, and during the migration it also served as the bridge between AWS and GCP.
Join us at this session to learn about the processes and tools used, the challenges faced, and the lessons learned as we moved our operations and petabytes of data from AWS to GCP with zero downtime.
This is the beautiful story of MQTT and Apache Kafka with the villages of Tanzania. 2G in Africa is not going anywhere any time soon. We serve microfinance banks with limited server resources in remote areas. These banks are beloved in the villages of Africa because they provide cheap loans. In Tanzania and Africa by extension, Agency Banking is the best innovation that ever happened to address financial inclusion. In this model, commercial banks contract third party retail networks as bank agents across the villages. Upon successful application, vetting and approval, these agents are authorized to offer select banking services on behalf of the banks. Bank agents are issued with POS Terminals which have the agency banking application installed. Uchumi Commercial Bank has over 350+ agents all over Tanzania. The bank initially went live with the POS Terminals connecting to the bank servers (bare metal) using REST API. However, many issues emerged. Poor internet connectivity, poor electricity connectivity(agents had to travel far to charge their POS Terminals), limited server capacity that just could not allow scalability, expensive internet bundles. Fadhili Juma will showcase and give a deep analysis on how MQTT and Apache Kafka was the game changer in this resource constrained environment.
Even though Kafka is scalable by design, proper handling of over one petabyte of data a day requires much more than Kafka’s scalability. Several challenges present themselves in a data centric business at this scale. These challenges include capacity planning, provisioning, message auditing, monitoring and alerting, rebalancing workloads with changes in traffic patterns, data lineage, handling service degradation and system outages, optimizing cost, upgrades, etc. In this talk we describe how at Pinterest we tackle some of these challenges and share some of the key lessons that we learned in the process. Specifically we will share how we:
• Automate Kafka cluster maintenance
• Manage over 150K partitions
• Manage upgrade lifecycle
• Track / troubleshoot thousands of data pipelines
Secrets are indisputably the biggest risk area in the authentication arena and Apache Kafka is no exception. Kafka services are typically configured using properties files which contain plain text secret configurations, upon startup these configurations are transmitted in clear text to different components, stored in filesystem, internal topics and logs thus creating a secret sprawl.
This talk will deep dive into how we can eliminate this secret sprawl by adding Config Providers to integrate with centralized management systems such as Vault, Keywhiz, or AWS Secrets Manager.
• Security implications of clear text secrets and secret sprawl
• Insecure parsing of secrets configurations in Kafka
• Know how about Kafka Config Providers
• Centralized Management Systems
• How to secure Kafka with CP and CMS
• Trust but Verify ~ Demo
The key takeaway from the session is understanding the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Recently Protobuf support was added to schema registry. Being the creator of a Rust crate to use schema registry from Rust, it made a lot of sense to also add support for Protobuf. Especially since the Protobuf libraries for Rust are more mature than the Avro ones. The talk will also contain some general information about choosing between Avro and Protobuf.
Developers are quickly moving to having Apache Kafka and events at the heart of their architecture. But how do you make sure your applications are resilient to the fluctuating load that comes with a never-ending stream of events? The Reactive Manifesto provides a good starting point for these kinds of problems. In this session explore how Kafka and reactive application architecture can be combined to better handle our modern event-streaming needs. We will explain why reactive applications are a great fit for Kafka and show an example of how to write a reactive producer and consumer.
Are you a Kafka Streams application developer who needs a faster, more efficient way to reproduce a bug or issue from past events? Do you need to test a new algorithm patch near a discrete point-in-time? Do you have a standard methodology for investigating these issues, or does each team member devise an ad hoc solution? How much time does your simulation take?
It is technically impossible to solve this challenge using only changelog topics, since a compact, topic-based changelog doesn't capture the full change history. The Derivatives Data team at Bloomberg augments this through periodic snapshots. To make the snapshotted state accessible to our replay system, we built a query service that leverages Interactive Queries using the Kafka Streams API, along with gRPC-based coordination, to fetch the distributed snapshot states from different snapshotting instances.
In this talk, we will cover:
• Overview of this system architecture
• Deep dive into the mechanics of snapshots with Kafka Streams, state-store changelogs, and a query service to serve replay requests
• How two modes, normal and replay, are used on the Kafka Streams application runtime
• Some use cases that benefit from this replay system
Every 2 seconds, another person becomes a victim of identity theft. The number of online account takeovers is constantly increasing. In this talk we'll show how stream processing was used to combat this for Tesco, one of Europe's largest retailers. The massive scale of e-commerce makes it an interesting target for malicious users. We implemented a risk-management platform built around Kafka and the Confluent Platform to detect and prevent attacks, including those that come through the website's authentication page. We'll present how this project evolved over 2 years to its current state in production, together with some of the challenges we encountered on the way. As the project has had a couple of phases, we will see and compare alternative designs, summarize their pros and cons, and refer them to well known techniques - like Event Sourcing. We'll discuss the architecture and integration with external systems, before moving onto a detailed examination of the stream processors implementation and key internals such as co-partitioning of data. We'll also cover the role of stack components that we used, including Kafka Connect and Schema Registry, as well as the deployment platform, Kubernetes. Over the course of the talk we will put special emphasis on highlighting key factors to take into consideration when designing data pipelines and stream processing platform.
Kafka is one of the most important foundation services at Zendesk. It became even more crucial with the introduction of Global Event Bus which my team built to propagate events between Kafka clusters hosted at different parts of the world and between different products. As part of its rollout, we had to add mTLS support in all of our Kafka Clusters (we have quite a few of them), this was to make propagation of events between clusters hosted at different parts of the world secure. It was quite a journey, but we eventually built a solution that is working well for us.
Things I will be sharing as part of the talk:
1. Establishing the use case/problem we were trying to solve (why we needed mTLS)
2. Building a Certificate Authority with open source tools (with self-signed Root CA)
3. Building helper components to generate certificates automatically and regenerate them before they expire (helps using a shorter TTL (Time To Live) which is good security practice) for both Kafka Clients and Brokers
4. Hot reloading regenerated certificates on Kafka brokers without downtime
5. What we built to rotate the self-signed root CA without downtime as well across the board
6. Monitoring and alerts on TTL of certificates
7. Performance impact of using TLS (along with why TLS affects kafka’s performance)
8. What we are doing to drive adoption of mTLS for existing Kafka clients using PLAINTEXT protocol by making onboarding easier
9. How this will become a base for other features we want, eg ACL, Rate Limiting (by using the principal from the TLS certificate as Identity of clients)
Takeaways I can think of now includes:
1. While building this I had the expectation that I will find lots of information about mTLS in Kafka on public internet, but to my surprise there isn’t much information online in this area (there’s blogs on how to configure Kafka with TLS, but not on how to build the Public Key Infrastructure supporting it), so I think this will inspire others to try and implement mTLS in their Kafka clusters
2. This will hopefully convince the audience, it’s not that hard to build a PKI infrastructure from scratch using open source tools and add mTLS support in Kafka clusters, which I think lots of people don’t try/give much thought
3. For those who will be interested in implementing something similar, will have a pretty good overview of the entire solution, so should be relatively easy for others to just follow what we did to build theirs
4. Automated certificate rotation is considered hard (Self-Signed Root CA certificate rotation is something even harder and people mostly ignore), so most usually either resort to using longer TTLs, this will hopefully show a way of automated regeneration of certificate both for brokers and clients without service disruption. So this will encourage a good security practice. Will also show a semi-automated way of rotating Root CA Certs without downtime.
5. I think the way we are approaching the wider adoption of mTLS for existing clients using PLAINTEXT is a good way of doing it. We have tried to make it as quick and easy as possible for product engineers to implement it by providing out of the box tools (as part of the foundation team, us doing the heavy lifting for product engineers means they can spend more time on building features for customers while still following best practices and staying secure). This might inspire others. Because in many cases, tools/solutions lack adoptions despite them adding value.
Deploying Kafka to support multiple teams or even an entire company has many benefits. It reduces operational costs, simplifies onboarding of new applications as your adoption grows, and consolidates all your data in one place. However, this makes applications sharing the cluster vulnerable to any one or few of them taking all cluster resources. The combined cluster load also becomes less predictable, increasing the risk of overloading the cluster and data unavailability.
In this talk, we will describe how to use quota framework in Apache Kafka to ensure that a misconfigured client or unexpected increase in client load does not monopolize broker resources. You will get a deeper understanding of bandwidth and request quotas, how they get enforced, and gain intuition for setting the limits for your use-cases.
While quotas limit individual applications, there must be enough cluster capacity to support the combined application load. Onboarding new applications or scaling the usage of existing applications may require manual quota adjustments and upfront capacity planning to ensure high availability.
We will describe the steps we took toward solving this problem in Confluent Cloud, where we must immediately support unpredictable load with high availability. We implemented a custom broker quota plugin (KIP-257) to replace static per broker quota allocation with dynamic and self-tuning quotas based on the available capacity (which we also detect dynamically). By learning our journey, you will have more insights into the relevant problems and techniques to address them.
The PARADIM Materials Innovation Platform has launched a revolutionary, streaming-data-based approach to the design and creation of new interface materials – new materials with the unprecedented properties needed to realize exciting technologies like quantum computing. PARADIM is a state-of-the-art scientific facility at Cornell, Johns Hopkins, and Clark Atlanta Universities that produces a firehose of complex and highly varied data. To build our data layer we’ve implemented Apache Kafka with newly developed, end-to-end encryption to connect our instruments, users, and world-class experts. Data reduction, analysis and management connect seamlessly to real-time ML tools in an integrated materials design workflow. At PARADIM, this flexible, scalable, and fault-tolerant data streaming is creating a new paradigm for collaborative science in the age of materials Big Data.
You have built an event-driven system leveraging Apache Kafka. Now you face the challenge of integrating traditional synchronous request-response capabilities, such as user interaction, through an HTTP web service.
There are various techniques, each with advantages and disadvantages. This talk discusses multiple options on how to do a request-response over Kafka — showcasing producers and consumers using single and multiple topics, and more advanced considerations using the interactive queries of ksqlDB and Kafka Streams. Advanced considerations discussed:
• What a consumer rebalance means to your active request-responses.
• Discuss options for blocking for the async response in the web-service.
• How can the CQRS (Command Query Responsibility Segregation) be leveraged with the interactive state stores of Kafka Streams and ksqlDB?
• Interactive queries of the ksqlDB and Kafka Streams state stores are not available during a rebalance. What is the active Kafka development happening that will make interactive queries a more feasible option?
• Would a custom state store help with rebalancing limitations?
• Can custom partitioning be used for proper routing, and what impacts could that have to the other services in your ecosystem?
• We will explore the above considerations with an interactive quiz application built using Apache Kafka, Kafka Streams, and ksqlDB. With a proper implementation in place, your request-response application can scale and be performant along with handling all of the requests.
How does Kafka Streams and ksqlDB reason about time, how does it affect my application, and how do I take advantage of it? In this talk, we explore the "time engine" of Kafka Streams and ksqlDB and answer important questions how you can work with time. What is the difference between sliding, time, and session windows and how do they relate to time? What timestamps are computed for result records? What temporal semantics are offered in joins? And why does the suppress() operator not emit data? Besides answering those questions, we will share tips and tricks how you can "bend" time to your needs and when mixing event-time and processing-time semantics makes sense. Six month ago, the question "What's the time? …and Why?" was asked and partly answered at Kafka Summit in San Francisco, focusing on writing data, data storage and retention, as well as consuming data. In this talk, we continue our journey and delve into data stream processing with Kafka Streams and ksqlDB, that both offer rich time semantics. At the end of the talk, you will be well prepared to process past, present, and future data with Kafka Streams and ksqlDB.
When choosing an event streaming platform, Kafka shouldn’t be the only technology you look at. There are a plethora of others in the messaging space today, including open source and proprietary software as well as a range of cloud services. So how do you know you are choosing the right one? A great way to deepen our understanding of event streaming and Kafka is exploring the trade-offs in distributed system design and learning about the choices made by the Kafka project. We’ll look at how Kafka stacks up against other technologies in the space, including traditional messaging systems like Apache ActiveMQ and RabbitMQ as well as more contemporary ones, such as BookKeeper derivatives like Apache Pulsar or Pravega. This talk focuses on the technical details such as difference in messaging models, how data is stored locally as well as across machines in a cluster, when (not) to add tiers to your system, and more. By the end of the talk, you should have a good high-level understanding of how these systems compare and which you should choose for different types of use cases.
Apache Kafka sits at the center of a technology ecosystem that can be a bit overwhelming to someone just getting started. Fortunately, Apache Kafka is also at the heart of an amazing community that is able and eager to help! So, if you are new, or relatively new, to Apache Kafka, welcome! I’d like to introduce you to the Kafka ecosystem, and present you with a plan for how to learn and be productive with it. I’d also like to introduce you to one of the most helpful and welcoming software communities I’ve ever encountered.
I’ll take you through the basics of Kafka—the brokers, the partitions, the topics—and then on and up into the different APIs and tools that are available to work with it. Consider it a Kafka 101, if you will. We’ll stay at a high level, but we’ll cover a lot of ground, with an emphasis on where and how you can dig in deeper.
I am still learning myself, so I will share with you what and who have helped me in my journey, and then I’ll invite you to continue that journey with me. It’s going to be a great adventure!
At Walmart, we offer a free delivery subscription service for customers. To prevent it from being misused, we run a fraud detection model on every transaction, and the model comes with very tight SLAs. So, we had to optimize for high availability and low serving latency in the Kafka streams cluster. In this talk, you’ll learn about our platform architecture and how we meet our SLA requirements. It includes discussing the tools we used for analyzing and tuning performance, and details of a custom client library that we wrote for Kafka Streams to query the precise machine on which state resides. We’ll also discuss improvements that we made to Apache Kafka, including:
• Serving reads from standbys (KIP-535)
• Serving reads while rebalancing (KIP-535)
• Reading from a store of a specific partition (KIP-562)
All of these enable us to deliver Kafka Streams applications that are focused on availability and throughput. Screen reader support enabled.
As advanced cyber threats continue to grow in frequency and sophistication, many large enterprises need to operate more effectively to protect their environments, especially as they grow in new markets. At Intel, we addressed this need by transforming from a legacy systems, to a modern, scalable Cyber Intelligence Platform (CIP) based on Splunk and Confluent Kafka. Our CIP helps our Security Operations Center identify and respond to threats faster, and supports hundreds of use cases coming from our entire Information Security organization. Today, our CIP ingests tens of terabytes of data each day and transforms this data into actionable insights with context-smart applications, streams processing, and machine learning. With Confluent Kafka serving as the core pub/sub message bus, we built a massive security data pipeline that achieves economies of scale by acquiring data once and consuming it many times. This pipeline helped us reduce our technical debt by eliminating legacy point-to-point custom connectors for our security controls and analytic solutions. At the same time, Kafka has given us the ability to operate on our data in-stream, shifting the needle of Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR) towards Near-Real-Time (NRT) in many cases. In our session, we’ll share how we architected and implemented CIP with pub/sub messaging system and streams processing, and some of the valuable lessons that we learned along the way. We’ll discuss the benefits of a highly integrated, yet loosely coupled set of security capabilities. We’ll also talk about some of our new threat detection techniques, such as:
• Filter, enrich, aggregate, join, and normalize data in-stream to deliver contextually rich and clean data, downstream to things like SIEM
• Apply logic to automate mundane tasks such as deduplication, auto branching, and filtering out false positives, extraneous, and/or bad data
• Maximize cluster availability and performance with tools like Confluent Control Center (C3), Replicator, and Multi-Region Clustering (MRC)
• Hunt for threats in-stream and NRT, with Kafka Streams and machine learning techniques Screen reader support enabled.
This session starts with the importance of Kafka as an event streaming and messaging platform for application-to-application communication - and gives a quick snapshot of the Confluent Platform. Then the "operators" method for deployment of many app platforms onto Kubernetes is underlined. We then take you step-by-step through a deployment of the Confluent Operator for Kafka on vSphere 7 with Kubernetes and show the benefits of this approach. We also show a second, external, Kafka message producer sending messages into the Kubernetes cluster and a consumer receiving them from there. This shows the ease of deployment, management and testing of Kafka with the Confluent Operator and Platform. Mention will be made about using a standalone Kubernetes cluster also. Attendees will leave with a good understanding of Kafka on modern vSphere.
To provide exceptional customer experiences at scale, the data pipelines that can move data reliably across the systems and applications in real-time should be seamlessly scalable. For the past several years, we relied on Message Queue based data pipelines to facilitate the transfer of data across the applications. However, as the number of use cases that require real-time data transfer increased rapidly, it became difficult to scale the messaging platform. Moving to Kafka helped us to resolve the data pipeline scaling issues and reduce the Publisher/Subscriber on-boarding time from several weeks to a few days. To support the on-demand scaling of Kafka clusters, we run them on RedHat OpenShift, an Enterprise Kubernetes. While managing Kafka that handles critical financial events, we have learned some lessons and developed efficient strategies to manage production-grade Kafka clusters on OpenShift. In this talk, we will present:
1. Some of the challenges that we faced with Kafka on OpenShift and how we evolved our infrastructure to overcome them.
2. Share our experiences from operating Kafka clusters at Scale in Production.
3. Our strategy for performing automated Kafka deployment and rollback in OpenShift.
4. Explain our fail-over strategy using Confluent’s Replicator to ensure service availability during cluster failures.
The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness. The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes. Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data. So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented. Screen reader support enabled.
When we set out to launch our Conversations Platform at Expedia Group our goals were simple. Enable millions of travelers to have natural language conversations with an automated agent via text, Facebook, or their channel of choice. Let them book trips, make changes or cancellations, and ask questions -- “How long is my layover?” “Does my hotel have a pool?” How much will I get charged if I want to bring my golf clubs?”. Then take all that we know about that customer across all of our brands and apply machine learning models to give customers what they are looking for immediately and automatically, whether it be a straightforward answer or a complex new itinerary. And the final goal: go from zero to production in four months.
Such a platform is no place for batch jobs, back-end processing, or offline APIs. To quickly make decisions that incorporate contextual information, the platform needs data in near real-time and it needs it from a wide range of services and systems. Meeting these needs meant architecting the Conversations Platform around a central nervous system based on Confluent Cloud and Apache Kafka. Kafka made it possible to orchestrate data from loosely coupled systems, enrich data as it flows between them so that by the time it reaches its destination it is ready to be acted upon, and surface aggregated data for analytics and reporting. Confluent Cloud made it possible for us to meet our tight launch deadline with limited resources. With event streaming as a managed service, we had no costly new hires to maintain our clusters and no worries about 24x7 reliability.
When we built the platform, we did not foresee the worldwide pandemic and the profound effect it would have on the travel industry. Companies were hit with a tidal wave of customer questions, cancellations, and rebookings. Throughout this once-in-a-lifetime event, the Conversations Platform proved up to the challenge, auto-scaling as necessary and taking much of the load off of live agents.
In this session, we’ll share how we built and deployed the Conversations Platform in just four months, the lessons we learned along the way, key points to consider for anyone architecting a platform with similar requirements, and how it handled the unprecedented demands placed upon it by the pandemic. We’ll also show a demo of the platform that includes high-level insights obtained from analytics and a visualization of the low-level events that make up a conversation. Screen reader support enabled.
In this talk, veteran Kafka success expert, Mitch Henderson, will covers the most common mistakes people make when first adopting Apache Kafka. We will start with smaller single application deployments and work our way up to common mistakes made by larger centralized platform deployments. Among other things, we'll discuss: - Rules of thumb for initial infrastructure and deployment considerations - protecting cluster stability from rouge applications hell bent on destruction - Single large cluster or many small? How to decide? - Getting users onboard and doing productive work After this session, you'll be able to avoid the common mistakes, and confidently guide your team toward a better class of mistakes - those that are directly relevant to the value you add to the business.
At PayPal, our Kafka journey, which started with a handful of isolated Kafka clusters, now has us marching at a rapid pace towards a trillion messages per day in a true multi-tenant Kafka environment with thousands of brokers. To enable that tremendous growth -- which today is reflected by 30% quarter-on-quarter increases in traffic volume -- we’ve had to overcome a wide variety of challenges, including:
Deploying the tooling required to operate multiple Kafka clusters at scale Monitoring and operating a large fleet of brokers across multiple availability zones Identifying the right set of metrics to monitor and then finding a way to automate remediation Troubleshooting customer issues related to broker connectivity and data loss Planning and optimizing capacity within budget constraints Managing a polyglot client stack with a diverse set of Kafka client libraries.
And of course, doing all of the above while ensuring compliance with the stringent data security requirements mandated for FinTech companies.
In this talk, we will describe the Kafka deployment model at PayPal and how we addressed these challenges to implement event streaming at enterprise scale. Independent of the scale of your Kafka deployment – be it a large enterprise already using Kafka, a midsize company in the midst of transforming your Big Data platform, or a small but rapidly growing company just getting started with Kafka – this talk will provide you with knowledge of the ins and outs of operating Kafka at scale, learn the challenges you can expect, and understand how you can address them will help you pave the way for your own Kafka journey. Screen reader support enabled.
Kafka is among the few cool technologies that play a pivotal role in the digital journey of DBS to become one of the best bank in the world. Kafka is the foundation for our vision of application modernisation by: - making our systems event-driven - moving batch workloads to real-time - serving various financial industry use-cases that warrants proactive actions i.e Fraud detection, Anomaly and unusual behaviour detections. To realize our Digital vision and to keep up with the pace of this transformation, we decided to run Kafka "as-a-self-managed-service" at large scale. This requires minimising reliance on human operators/ engineers and have intelligent apps make all important operation related decisions. In this session, I would like to deep-dive into these intelligent apps that not only helped us maintain 100% availability but also provides self-service experience and a truly democratised Kafka ecosystem for our community. Some of these apps include:
1. Self-service on-boarding and stream discovery built by esp. leveraging Kafka admin API
2. AI Ops - using machine learning for predicting resource utilisation and early anomaly detection
3. Automated Cluster scaling, deployment and maintenance
4. Self-healing services
5. Intelligent Chaos engineering - to make apps ready for production
6. Black box Kafka monitoring
7. Load simulator for performance testing - to bring production like experience in development environments
8. Automatic tiering and streams archival - to meet several regulatory compliance and to make brokers lightweight
At Capital One we process billions of events everyday without any data loss and following exactly once processing semantics as well, some of these events are related to fraud detection, customer communications, payments processing...etc. Most of the times it is necessary for us to process these events in real time so we ReImagined auto scaling Kafka consumers to offer beyond predictable performance even during high volumes of inbound events as well as reduce resource utilization whenever possible. We are also making sure that there is no data loss and exactly once event processing when we are offering high tps. Unlike traditional applications, auto scaling an event driven real time stream consumer applications requires much more parameters and edge cases to consider. My talk will be covering all of these parameters, metrics, edge cases and steps to autoscale consumers safely..etc.
Here are highlights of my talk: Streaming architecture high level overview which is supporting processing billions of events Our journey to ReImagine Autoscaling of streams Autoscaling benefits and how to Autoscale consumers to offer beyond predictable performance Lessons learned along the way and best practices In addition to this, giving an overview on observability and monitoring streams.
You know the fundamentals of Apache Kafka. You are a Spring Boot developer and working with Apache Kafka. You have chosen Spring Kafka to integrate with Apache Kafka. You implemented your first producer, consumer and maybe some Kafka streams, it's working... Hurray! You are ready to deploy to production what can possibly go wrong? In this talk, Tim will take you on a journey beyond the basics of Spring Kafka and will share his knowledge, pitfalls and lessons learned based on real-life Kafka projects that are running in production for many years at ING Bank in the Netherlands. This deep-dive talk is a balanced combination of code examples, fully working demos and theory behind the lessons learned like: * Survive consuming a poison pill from a Kafka topic without writing gigabytes of log lines and blow up your server's hard drive? * Understand how to deal with exceptions in different cases * Validate incoming data consumed from a Kafka topic. After joining this talk, you will be inspired to bring your Spring Kafka projects to the next level and share some practical tips and tricks with your fellow developer colleagues!
This free, two-hour Fundamentals course provides an overview of what Kafka is, what it's used for, and the core concepts that enable it to power a highly scalable, available and resilient real-time event streaming platform.
This one-hour Certification Bootcamp is designed for those preparing for the Confluent Certified Developer or Confluent Certified Administrator Certification on Apache Kafka®. During this session, participants will become familiar with the certification process, and will gain invaluable tips on the examination process and how best to prepare for the exam.
In software systems, architecture that served the enterprise in the pre-cloud era is no longer sufficient. The need for cohesive customer experience is forcing the modernization of information and data supply chains. To enable next-generation application architecture, organizations are shifting from traditional data silos and batch processing to a streaming-first approach. Event streaming paradigms as seen in Kafka-based architecture are becoming de facto for such solutions. This session will dive deep into the engineering methodology, inspired by domain-driven design, to model business functions in an events-first approach and provide a practice to implement the event streaming architecture.
We’ll navigate a journey through application modernization. First, we explore business domain modeling with the swift methodology. Next, we work on a business flow that satisfies the most important functional requirements for the business stakeholder and addresses pain points surfaced from event Storming. Finally, we focus on the technical challenges of the architecture to satisfy the non-functional requirements. An example of a non-functional requirement described in this session is achieving high availability across multiple data centers while maintaining data integrity service level objectives. We accomplished this by moving business-critical workloads from legacy middleware to a Kafka-native solution using the Kafka Streams library.
Apache Kafka is getting used as an event backbone in new organizations every day. We would love to send every byte of data through the event bus. However, most of the time, connecting to simple third party applications and services becomes a headache that involves several lines of code and additional applications. As a result, connecting Kafka to services like Google Sheets, communication tools such as Slack or Telegram, or even the omnipresent Salesforce, is a challenge nobody wants to face. Wouldn’t you like to have hundreds of connectors readily available out-of-the-box to solve this problem?
Due to these challenges, communities like Apache Camel are working on how to speed up development of key areas of the modern application, like integration. The Camel Kafka Connect project, from the Apache foundation, has enabled their vastly set of connectors to interact with Kafka Connect natively. So, developers can start sending and receiving data from Kafka to and from their preferred services and applications in no time without a single line of code.
In summary, during this session we will:
• Introduce you to the Camel Kafka Connector sub-project from Apache Camel
• Go over the list of connectors available as part of the project
• Showcase a couple of examples of integrations using the connectors
• Share some guidelines on how to get started with the Camel Kafka Connectors
This session will describe and demonstrate the longstanding integration between Couchbase Server and Apache Kafka and will include descriptions of both the mechanics of the integration and practical situations when combining these products is appropriate.
The challenge with today’s “data explosion” is finding the most appropriate answer to the question, “So where do I put my data?” while avoiding the longer-term problem: data warehouses, data lakes, cloud storage, NoSQL databases, … are often the places where “big” data goes to die.
Enter Physics 101, and my corollary to Newton’s First Law of Motion:
• Data in motion tends to stay in motion until it comes rest on disk. Similarly, if data is at rest, it will remain at rest until an external “force” puts it in motion again.
• Data inevitably comes to rest at some point. Without “external forces”, data often gets lost or becomes stale where it lands. “Modern” architectures tend to involve data pipelines where downstream consumers of data make use of data generated upstream, often with built-for-purpose repositories at each stage. This session will explore how data that has come to rest can be put in motion again; how Kafka can keep it in motion longer; and how pipelined architectures might be created to make use of that data.
DataOps challenges us to build data experiences in a repeatable way. For those with Kafka, this means finding a means of deploying flows in an automated and consistent fashion.
The challenge is to make the deployment of Kafka flows consistent across different technologies and systems: the topics, the schemas, the monitoring rules, the credentials, the connectors, the stream processing apps. And ideally not coupled to a particular infrastructure stack.
In this talk we will discuss the different approaches and benefits/disadvantages to automating the deployment of Kafka flows including Git operators and Kubernetes operators. We will walk through and demo deploying a flow on AWS EKS with MSK and Kafka Connect using GitOps practices: including a stream processing application, S3 connector with credentials held in AWS Secrets Manager.
Real-time connectivity of databases and systems is critical in enterprises adopting digital transformation to support super-fast decisioning to drive applications like fraud detection, digital payments, recommendation engines. This talk will focus on the many functions that database streaming serves with Kafka, Spark and Aerospike. We will explore how to eliminate the wall between transaction processing and analytics by synthesizing streaming data with system of record data, to gain key insights in real-time.
Managing a distributed system like Apache Kafka can be extremely challenging, especially when you try to approach monitoring and managing from a single centralized GUI approach. In this talk come here and see a demo of a more decoupled approach to Kafka management and Kafka Monitoring where data is centralized but access is is distributed to scale to enterprise deployments, CICD pipelines and much much more.
Whether you are a die-hard DC comic enthusiast, mad for Marvel, or completely clueless when it comes to comic books, at the end of the day each of us would love to possess the superpower to transform data in seconds versus minutes or days. But architects and developers are challenged with designing and managing platforms that scale elastically and combine event streams with stored data, to enable more contextually rich data analytics. This made even more complex with data coming from hundreds of sources, and in hundreds of terabytes, or even petabytes, per day.
Now, with Apache Kafka and Intel hardware technology advances, organizations can turn massive volumes of disparate data into actionable insights with the ability to filter, enrich, join and process data instream. Let's consider Information Security. IT leaders need to ensure all company data and IP is secured against threats and vulnerabilities. A combination of real-time event streaming with Confluent Platform and Intel Architecture has enabled threat detection efforts that once took hours to be completed in seconds, while simultaneously reducing technical debt and data processing and storage costs.
In this session, Confluent and Intel architects will share detailed performance benchmarking results and new joint reference architecture. We’ll detail ways to remove Kafka performance bottlenecks, and improve platform resiliency and ensure high availability using Confluent Control Center and Multi-Region Clusters. And we’ll offer up tips for addressing challenges that you may be facing in your own super heroic efforts to design, deploy, and manage your organization’s data platforms.
Apache Kafka fundamentally changes how organizations build and deploy a universal data pipeline that is scalable, reliable, and durable enough to meet the needs of digital-first organizations. However, as powerful as Kafka is today, it’s not an event-streaming platform - and getting it there on your own is a long, complicated, and expensive process. Earlier this year Confluent announced Project Metamorphosis - our plan to bring the best characteristics of cloud native systems to Apache Kafka. Since May we’ve begun transforming Confluent Cloud and Confluent Platform to do just that.
Join two of our Product Managers, Dan Rosanova and Addison Huddy to: Learn how we’ve evolved Confluent Cloud with the first phase of Project Metamorphosis releases
See how Confluent Platform 6.0 brings these transformational, cloud-like qualities to self-managed Kafka Get a sneak peak of our next Metamorphosis theme and how it impacts your Kafka and event-streaming strategy.
Kafka and MemSQL are the perfect combination of speed, scale, and power to take on the world’s most complex operational analytics challenges. In this session, you will learn how Kafka and MemSQL have become the dynamic duo, and how you can use them together to achieve ingest of tens of millions of records per second and enable highly concurrent, real-time analytics. In the last few months, Kafka and MemSQL have been hard at work, devising a plan to take on the world’s next set of streaming data challenges. So stay tuned: there may just be an announcement!
As Kafka deployments grow within your organization, so do the challenges around lifecycle management. For instance, do you really know what streams exist, who is producing and consuming them? What is the effect of upstream changes? How is this information kept up to date, so it is relevant and consistent to others looking to reuse these streams? Ever wish you had a way to view and visualize graphically the relationships between schemas, topics and applications? In this talk we will show you how to do that and get more value from your Kafka Streaming infrastructure using an event portal. It’s like an API portal but specialized for event streams and publish/subscribe patterns. Join us to see how you can automatically discover event streams from your Kafka clusters, import them to a catalog and then leverage code gen capabilities to ease development of new applications.
Cloud is changing the world; Kubernetes is changing the world; real-time event streaming is changing the world. In this talk we explore some of best practices to synergistically combine the power of these paradigm shifts to achieve a much greater return on your Kafka investments. From declarative deployments, zero-downtime upgrades, elastic scaling to self-healing and automated governance, learn how you can bring the next level of speed, agility, resilience, and security to your Kafka implementations.
Kafka has fast become the center of streaming analytics applications in the modern digital enterprise. Kafka operates in the context of a broad ecosystem of data lifecycle components which need a consistent platform of security, monitoring, management and governance. This problem becomes paramount when your streaming architectures go hybrid by spanning from on-premises to the cloud. Throw in the reality of a multi-cloud setup that a lot of enterprises are facing and now, you have a complex streaming architecture that is difficult to operationally manage, monitor, secure or govern.
Cloudera remains committed to an open community driven approach and increasing the ease of use and visibility for Kafka based solutions. Attend this session to understand more about how streaming architectures can be extended easily to the hybrid cloud and multi-cloud. Also, learn about our plans for further community contributions.
The scale of Google's business has often put it in a place where data processing innovation was key to better serving our customers and users. This session will explore the evolution of data processing at Google from batch, to stream, to cloud services, and some of its impact on the broader data processing landscape.
Kafka Connect makes it possible to easily integrate data sources like MongoDB! In this session we will first explore how MongoDB enables developers to rapidly innovate through the use of the document model. We will then put the document model to life and showcase how to integrate MongoDB and Kafka through the use of the MongoDB Connector with Apache Kafka. Finally, we will explore the different ways of using the connector including the new Confluent Cloud integration.
The adoption and popularity of the microservices architecture continues to grow across a spectrum of enterprises in every industry. Although a consensus on an implementation standard has yet to be reached, advanced design patterns and lessons learned about the complexities and pitfalls of deploying microservices at scale have been established by thought leaders and the development community. With Redis and Kafka becoming de facto standards across most microservices architectures, we will discuss how their combination can be used to simplify the implementation of event-driven design patterns that will provide real-time performance, scalability, resiliency, traceability to ensure compliance, observability, reduced technology sprawl, and scale to thousands of services. In this discussion, we will decompose a real-time event-driven payment-processing microservices workflow to explore capturing telemetry data, event sourcing, CQRS, orchestrated SAGA workflows, inter-service communication, state machines, and more.