View sessions and slides from Kafka Summit New York City 2017
Kafka is a cornerstone of LinkedIn’s data infrastructure. It is the replication stream for Espresso; the message transport for Brooklin (our change capture system), Samza and Venice (our derived data serving store). We describe Kafka’s fundamental roles: data storage, movement, processing and analysis; and discuss the requirements to serve these data systems, issues that we hit and how we addressed them.
What if the host goes down? What if the DC goes down? What if the broker is not responding? What if the client dies before an ack? What if I bounce the cluster? What if we run out of disk? In this talk we explore these and many other questions that we answered in our journey to getting comfortable running critical parts of our infrastructure on Kafka and offer our solutions along the way.
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. This talk will provide a deep dive on some of the key internals that help make Kafka popular. Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput. Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism. One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Target is $75 billion retailer w/~1800 stores, dozens of distribution centers, and a multi-billion dollar e-commerce business. When an event happens in our environment, whether it be a price change or a sale, it is possible that thousands of distinct systems care about it. Learn how Target is using Apache Kafka to simplify data movement and how we run Kafka in a world with thousands of producers/consumers.
Security is a critical requirement for many Kafka deployments, especially in the cloud. In this talk, we will look at security features in Kafka that protect data streams and ensure the safety and integrity of data stored in Kafka topics. We will learn how to secure Kafka clusters, integrate with existing security infrastructure and understand the threats and mitigations in different environments.
Kafka’s an interesting distributed system. It’s pretty ops friendly! But, running lots of clusters requires heavy levels of automation and careful takes on operations. This talk will cover lessons learned in running hundreds of different clusters, with different workloads, war stories from same, and ways Kafka could be more friendly towards automation.
Venice is a new distributed system leveraging Kafka at LinkedIn. Venice pushes the envelope of Kafka by requiring dynamic topic creation and deletion, infinite retention, and cross-DC replication with identity partitioning and strict ordering guarantees. This talk will present the Kafka configs, bug fixes and new features that made these use cases possible.
Apache Kafka’s rise in popularity as a streaming platform has demanded a revisit of its traditional at least once message delivery semantics. In this talk, we present the recent additions to Apache Kafka to achieve exactly once semantics. We shall discuss the newly introduced transactional APIs and use Kafka Streams as an example to show how these APIs are leveraged for streams tasks
Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. By cleanly separating the user’s processing logic from details of the underlying execution engine, the same pipelines will run on any Apache Beam runtime environment, whether it’s on-premise or in the cloud, on open source frameworks like Apache Spark or Apache Flink, or on managed services like Google Cloud Dataflow.
We talk about how Kafka helps you to radically simplify your data architectures. We cover how you can build applications to serve your processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from scalability, elasticity, and fault-tolerance. We introduce Kafka’s Streams API, its abstractions for streams and tables, and interactive queries.
On our project, we built a great system to analyze customer records in real time. We pioneered a microservices architecture using Spark and Kafka and we had to tackle many technical challenges. In this session, I will show how Kafka Streams provided a great replacement to Spark Streaming and I will explain how to use this great library to implement low latency data pipelines.
In this talk, we’ll show how AltspaceVR developed a Kafka Streams based solution to perform real time mirroring, capture, and playback of networked avatars in a shared VR environment. For intermediate to advanced Kafka Streams users, we’ll cover some common pitfalls, lessons learned, and design patterns that helped us create a streaming application that provides features our users call magic.
Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics include queryable state, dynamic scaling, streaming SQL, very large state support, flexible deployment options, and whatever is the latest and greatest in May 2017.
Typically when we build service based apps, microservices, SOA and the like, we use REST or some RPC framework. But building such applications becomes tricky as they get larger, more complex and share more data. We can trace this trickiness back to a dichotomy that underlies the way systems interact: Data systems are designed to expose data, to make it freely accessible. But services, instead, focus on encapsulation. Restricting the data each service exposes. These two forces inevitably compete as such systems evolve. This talk will look at a different approach. One where a distributed log can be used to hold data that is shared between services. Then stateful stream processors are embedded right in each service, providing facilities for joining and reacting to the shared streams. The result is a very different way to architect and build service-based applications, but one with some unique benefits as we scale.
Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees. Since Spark 2.0 we’ve been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from Kafka. We’ll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
The Marketplace data team at Uber has built a scalable complex event processing platform to solve many challenging real-time data needs for various Uber products. This platform has been in production for >1 year and supporting over 100 real-time data use cases with a team of 3. In this talk, we will share the detail of the design and our experience, and how we employ Kafka and Samza at scale.
The stream-table duality means that we can view streams as tables and tables as streams. If we have a real database table and want a stream, we need to capture the changes to the data in the table and record them as a stream. Debezium’s open source DBMS-specific Kafka Connect connectors do exactly this, making it very easy to your software to react to your changing data in near real time.
Yelp moved quickly into building out a comprehensive service oriented architecture, and before long had over 100 data-owning production services. Distributing data across an organization creates a number of issues, particularly around the cost of joining disparate data sources, dramatically increasing the complexity of bulk data applications. Straightforward solutions like bulk data APIs and sharing data snapshots have significant drawbacks. Yelp’s Data Pipeline makes it easier for these services to communicate with each other, provides a framework for real-time data processing, and facilitates high-performance bulk data applications – making large SOAs easier to work with. The Data Pipeline provides a series of guarantees that makes it easy to create universal data producers and consumers that can be mashed up into interesting real-time data flows. We’ll show how a few simple services at Yelp lay the foundation that powers everything from search to our experimentation framework.
Many companies face the challenge of gathering data from disparate sources, organizing these data into a knowledge base, and serving up this information in a relevant way to their customers. With over 19 billion digitized historical records, 80 million family trees, 10 billion people/tree nodes, and three million DNA customers, Ancestry has a trove of data that helps people gain a greater sense of who they are, where they came from, and who else shares their family’s journey. Companies with big data challenges often integrate Kafka into their data architecture; yet despite successful Kafka integration, dealing with massive quantities of data can still be overwhelming and debilitating. How can you make sense of all the data from various sources in your warehouse, meet the needs of a growing business, and remain competitive? One of the ways Ancestry tackled this challenge was by introducing the schema registry into the heart of the data fabric and development process. The results have been transformational and dramatically reduced the time it takes from data source definition to reporting and production. Pre-schema registry, new data sets could take months to implement and then get lost in the petabytes of data. Join this session and learn how Ancestry brought new data sets into production in not just months, or weeks, but DAYS and how the schema registry transformed our data fabric.
Bank of New York Mellon (BNYM) created a scalable data distribution hub to analyze 100s of million to billions of transactions on a daily basis. The data distribution hub based on Apache Kafka connects multiple data source feeds to HPE Vertica for analytics, in turn feeding to output workflows. Learn how, Bank of New York Mellon processes data feeds, transforming and validating data quality and leverages Apache Kafka to distribute and buffer messages, accommodating data spikes so that the overall system is streamlined to optimally manage system resources and to analyze data workflow requests in a timely manner.
Kafka plays a central role in the data ecosystem at Airbnb. We operate multiple clusters powering use cases such as analytics, change data capture and inter-service communication. In this talk we present on the design and usage of our streaming data pipeline for product analytics, including the services, tools and workflows we developed allowing us to meet strict quality and latency guarantees.
As one of the major news organizations in the world, The New York Times published its first issue on September 18, 1851. In the ensuing 164 years, we have published approximately 16 million articles, for the past 20 years also online. We are now building a new publication pipeline around Kafka. This pipeline functions as the source of truth for all published content, and decouples producers from consumers of content. The pipeline is implemented as an immutable log – all content is published to the log, and all the back-end systems driving the different online experiences access content by consuming this log. I want to explain how this publishing pipeline works, how it interacts with other system, and what our experiences have been so far.
Kafka Connect recently introduced a feature called “single message transforms” for lightweight transformation inline with import and export to/from Kafka. Do they solve a real problem? Yes. Are they going to solve your ETL woes? No. In this talk I’ll explain how you *can* solve these problems, using Kafka, Kafka Connect, and Kafka Streams.
The future of streaming data pipelines is in the cloud, combining the agility of microservice architecture with the elasticity, reliability and scalability of cloud platforms. We’ll show you how to build Kafka and KStream microservices using Spring Cloud Stream and how to orchestrate and deploy them seamlessly to cloud platforms like Cloud Foundry, Kubernetes or Mesos using Spring Cloud Data Flow.