Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
I am very excited that LinkedIn’s deployment of Apache Kafka has surpassed 1.1 trillion (yes, trillion with a “t”, and 4 commas) messages per day. This is the largest deployment of Apache Kafka in production at any company.
For those of you not familiar with Kafka, it is a horizontally scalable messaging system. It lets you take everything happening in your company and turn it into a real-time data stream that other systems can subscribe to and process. For users of this software, it acts as a replacement for traditional enterprise message brokers and a way to sync data between different systems and databases, as well as being the basis for real-time analytics and stream processing.
It has been exciting to see Kafka scale over the years. When the LinkedIn team, which included me and my co-founders, Jay Kreps and Jun Rao, created the system in 2010, we had big dreams of seeing it turn into a central nervous system for data, but we were starting far from that. As early LinkedIn employees, we got an opportunity to experience the pain caused by LinkedIn’s then legacy infrastructure. Getting a chance to observe the limitations of various old systems, allowed us to guide the evolution towards modern distributed infrastructure that power the experience shared by the 380+ million LinkedIn users around the world.
Kafka plays a critical part in shaping LinkedIn’s infrastructure as well as that for the hundreds of other companies that use Kafka – from web giants like Netflix, Uber, and Pinterest to large enterprises like Cerner, Cisco and Goldman Sachs. At these companies, Kafka powers critical data pipelines, allows data to be synced in real-time across geographically distant data centers and is a foundation for real-time stream processing and analytics.
Kafka has made it possible for companies to build better products and provide a richer and more real-time experience to their users at scale. For instance, can you imagine not receiving an important story in your LinkedIn newsfeed instantaneously? Or not receiving instant movie recommendations on Netflix?
When the team first put Kafka in production in July 2010, it was used for powering some of LinkedIn’s user activity data. By 2011, it powered 1 billion messages per day. We increased that to power not only all user activity data but all metrics and alerts for monitoring LinkedIn’s IT infrastructure, taking Kafka’s deployment size to more 20 billion messages per day. As we progressed we came to capture virtually everything that happened in the company using Kafka – from someone updating their profile, creating an ad campaign, or adding a connection, down to database changes. And Kafka went on to become the company’s central nervous system, acting as the critical pipeline supplying all that data to all the systems and applications. This included monitoring applications, search, graph database, the Hadoop clusters, and the data warehouse. By mid 2014, at the time we left LinkedIn to start Confluent, Kafka was powering more than 200 billion messages per day.
We never imagined the rapid and broad adoption of Kafka would reach these heights at LinkedIn, let alone the many other companies that now depend on it for mission critical business.
At Confluent, we are committed to serving the open source community and further growing the adoption of Kafka at companies worldwide. That focus is supported by metrics – since we started Confluent, the number of monthly downloads for Kafka has increased 7x just in the past year.
We have distilled years of production experience of operating Kafka at scale into our product offering – the Confluent Platform. We also plan to continue engaging with the open source community through mailing lists, meetups, and our blog.
The developer tools in the Confluent Platform are 100% open source and provide you with what you need to put Kafka in production at scale.
In the months ahead, we will advance the platform by enhancing security and by adding multi-tenancy, streaming ETL, and stream processing capabilities.
LinkedIn’s transition to Kafka had a profound impact on the company’s ability to harness its data at scale. Data that was previously locked up in silos is now instantaneously available for processing. New high-volume data sources, like user activity data and log data, that could not be collected previously in LinkedIn’s legacy systems are now easily collected using Kafka. The same data that goes into the offline data warehouse and Hadoop is available for real-time stream processing and analytics in all applications. And all the data collected is available for storage or access in the various databases, search indexes, and other systems in the company through Kafka.
The world is changing and these days LinkedIn isn’t the only company that needs to harness vast streams of data. GPS enabled devices, mobile phones, the internet of things, financial data streams, and telecom are all built around large scale stream data. At Confluent we are helping these companies put our experience and infrastructure to use building real-time stream processing systems in all these domains.
We look forward to partnering with LinkedIn and other companies to grow the adoption of Apache Kafka as it becomes the de-facto standard for real-time data transport and processing in companies globally.
Get an introduction to why Python is becoming a popular language for developing Apache Kafka client applications. You will learn about several benefits that Kafka developers gain by using the Python language.
Discover tools, practices, and patterns for planning geo-replicated Apache Kafka deployments to build reliable, scalable, secure, and globally distributed data pipelines that meet your business needs.