Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
Today I’m excited to announce that Neha Narkhede, Jun Rao, and I are forming a start-up called Confluent around realtime data and the open source project Apache Kafka® that we originally co-created while at LinkedIn.
Kafka is a horizontally scalable messaging system. It let’s you take everything happening in your company and turn it into a realtime data stream that other systems can subscribe to and process. For users of this software it acts as both a replacement for traditional enterprise message brokers, a way to sync data between different systems and databases, as well as the basis for realtime analytics.
I’ll come to our new company in a bit, but first let’s talk about the software. I think most software has a story, so let me tell you ours.
I joined LinkedIn about seven years ago. At the time I joined, the company was just starting to run into scaling problems in its core database, its social graph system, its search engine, and its data warehouse. Each of the systems that stored data couldn’t keep pace with the growing user base. At the time, each of our data systems ran on a single server. This left us having to continually buy bigger and bigger machines as we grew. So the task I ended up focusing on for much of my time at LinkedIn was helping to transition to distributed data systems so we could scale our infrastructure horizontally and incrementally. We built a distributed key-value store to help scale our data layer, we built a distributed search and social graph system too, and we brought in Hadoop to scale our data warehouse. And yes, all of these systems had names just as unusual as Kafka (e.g. Voldemort, Azkaban, Camus, and Samza).
This transition was both powerful and painful. It was powerful because beyond just keeping our website alive as we grew, moving to this type of horizontally distributed system allowed us to really make use of the data about the professional space that LinkedIn had collected. LinkedIn has always had amazing data on the people and companies in the professional world and how they are all connected and inter-related, but finally being able to apply arbitrary amounts of computational power to this data using frameworks like Hadoop let us build really cool products.
But adopting these systems was also painful. Each system had to be populated with data, and the value it could provide was really only possible if that data was fresh and accurate.
This was how I found myself understanding how data flows through organizations, and a little bit about what people in the data warehouse world call ETL. The current state of this kind of data flow is a scary thing. It is often built out of csv file dumps, rsync, and duct tape, and it’s in no way ready to scale with the big distributed systems that are increasingly available for companies to adopt.
More surprisingly, data is often only collected in batch, extracted just once a day from different parts of a company and munged into a usable form. This kind of nightly processing is surprising to people outside the tech world, who have understandably grown used to a world where the same email on your phone is also instantly available on your laptop, without waiting for some kind of nightly copy job to sync them. But this state of things is so common inside the world of data warehousing and analytics that a 24 hour delay is hardly even thought about.
I ran into this problem in a round about way. I was managing the team responsible for adopting Hadoop and putting it to use on some of our hardest computational problems for recommendations. Our feeling about Hadoop was largely aspirational—if we could bring to bear unlimited computational power, surely we could do something useful with it. But what we quickly realized was that it isn’t just the computation that brings the value, it is the data itself.
The most creative things that happen with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos. Or, even more often, measuring things for that first time that previously weren’t even recorded.
So to really succeed we realized that any systems we wanted to use would only be as valuable as the data it had access to. To make our Hadoop initiative successful we needed a full, high-fidelity copy of everything happening in LinkedIn in Hadoop. So we put our heads down and worked on building out data loads for Hadoop.
After lots of hard work we evaluated how well we were doing against this goal. Our hard work had generated lots of code to maintain and was pretty hard to keep running, but had we gotten all the data? The answer was disappointing. We did an inventory and we estimated that with all our effort we had only succeeded in covering about 14% of the data in the company. And each new data source we added created a little bit more ETL work for our team to manage and operationalize. Given the pace that LinkedIn itself was evolving it wasn’t even clear that this approach would let us keep pace with new data being created, let alone get to full coverage without a small army of people.
And worse, Hadoop was just one problem. We also had a need for realtime processing of data for many cases. We needed to monitor the security of our site in realtime and detect spammers and scrapers immediately. We had a product that showed who had viewed your LinkedIn profile, an ad systems, analytical and reporting systems, and a monitoring system all of which needed to generate or consume data in real time. We needed to be able to integrate data into all the different systems we had.
Surely the state of the art wasn’t building out ad hoc pipelines for each of these systems? Surely the same data that went into our offline data warehouse should be available for realtime processing. And both of these should be available for storage or access in the various databases, search indexes, and other systems we ran?
At first we were convinced that this problem simply had to have been solved, and we were just ignorant of the solution. We set about evaluating messaging systems, Enterprise Service Buses, ETL tools, and any other related technology that claimed to solve the problem.
But we found each of these to be pretty flawed and quite clunky. Some only did offline data integration, relying on periodic bulk transfers rather than a continuous flow. This wouldn’t work for many of our systems, where users expect data to flow immediately behind the scenes from system to system. Some tools were realtime but couldn’t really provide the kind of guarantees required for critical data. None were built like a modern distributed system that you could safely dump all the data of a growing company into and scale along with your needs.
This is how we came to originally create Apache Kafka. We thought of it as a messaging system but it was built by people who had previously worked on distributed databases so it was designed very differently. So it came with the kind of durability, persistence, and scalability characteristics of modern data infrastructure. We worked hard to come up with an abstraction that would bind all these use cases together—low latency enough to satisfy our realtime data flows, scalable enough to handle very high volume log and event data, and fault-tolerant enough for critical data delivery.
But we didn’t just have in mind copying data from place to place. We also wanted to allow programs and services to tap into these data feeds and process them in real time. This kind of “stream processing” is essential for building programs that react in realtime to the data streams that are captured.
It quickly happened that within LinkedIn a tool that had started with a focus on ETL and loading data into Hadoop became something much more. We started collecting not just data about the business and website activity, but operational data—error logs and performance metrics. We built systems for analyzing the performance of all our services that each system calls and generated all this as a set of Kafka feeds. Kafka became the basis for the realtime metrics, analytics, and alerting as well as a variety of specialized monitoring tools for different domains such as site performance. As we progressed we came to capture virtually everything that happened in the company, from someone updating their profile, creating an ad campaign, or adding a connection, down to database changes. This ability to treat disparate types of data from high-level business activity, to low level system details all as realtime streams let us start to apply the same tools to all the different types of data we had. We could feed these streams into search systems, dump them into Hadoop and analyze performance data using SQL queries, or run realtime processing against them.
We felt that what we had ended up creating was less a piece of infrastructure and more a kind of central nervous system for the company. Every thing that happened in the business would trigger a message that would be published to a stream of activity of that type. This message acted as a kind of impulse that would let other systems react in appropriate ways—sending emails, applying security measures, storing the data for analysis and reporting, or scanning for operational problems. Each of these streams can be subscribed to by any other system, allowing programs to react in realtime to what is happening, or data to be mirrored into Hadoop, which acts as a kind of long-term memory bank.
We put a lot of effort into being able to scale this system to fulfill this vision, eventually transmitting over 300 billion messages a day across geographically distributed data centers with only milliseconds of delay. Tens of thousands of processes in each data center would tap into these streams to send out or receive messages.
What began as a tool for copying data around really ended up changing how we thought about data. We were convinced this was a big idea so we decided to make the software open source. We released it, initially as a LinkedIn open source project and eventually, as interest grew, donating the code to the Apache Software Foundation and continuing the development there. Apache has proven to be a fantastic place for the development of open source platform software and we are very proud to be a part of it.
The open source project did well and now there are thousands of companies using Kafka in similar ways. These companies include Goldman Sachs, Uber, Netflix, Pinterest, and Verizon.
But we realized that just open sourcing Kafka, the core infrastructure, wasn’t really enough to make this idea of pervasive realtime data a reality. Much of what we did at LinkedIn wasn’t just the low-level infrastructure, it was the surrounding code that made up this realtime data pipeline—code that handled the structure of data, copied data from system to system, monitored data flows and ensured the correctness and integrity of data that was delivered, or derived analytics on these streams—all of these things were still internal to LinkedIn and not really built in a general purpose way that companies could adopt easily without significant engineering effort to recreate similar capabilities.
After working with many companies using Kafka through open source we decided we would start a company specifically devoted to commercializing this idea. We wanted to make bringing this realtime streams vision to the rest of the world our full time job.
LinkedIn, which had long supported the open source efforts, was very supportive of this idea. The people at LinkedIn were as proud of what had been done as we were and wanted the technology to succeed. They will continue to help develop the technology as well as being the heaviest and most ambitious Kafka deployment around. We found a lead investor, Eric Vishria at Benchmark, who shared our vision for an open, realtime data platform. We’re very happy to have Benchmark on board; they have been a long-time partner for successful open source companies from Red Hat and JBoss to ElasticSearch, Docker, and Hortonworks. LinkedIn and Data Collective joined Benchmark for a total of $6.9M to finance this new venture.
One thing I should emphasize is that the formation of the commercial entity doesn’t in any way signal a move away from open source. Personally I have been doing open source for a half-dozen years now and at least as many projects. It is absolutely the right way to develop software platforms, and open source companies are increasingly showing that they can succeed. Though we will produce some proprietary tools to complement our open source offering, Kafka will remain 100% open source, as will much of the rest of what we plan to do. In very practical terms the formation of a company dedicated to the effort will let us significantly increase the development effort beyond the small full time team at LinkedIn and the largely spare time contributors in the open source community.
Confluent is just getting off the ground, but, since Kafka itself is open source and widely used, we wanted to tell people what we are doing now, rather than try to keep it a secret. If you’re a company interested in exploring a realtime architecture for some of your applications, we would love to talk with you, early though we are. If you are an engineer interested in helping us build this future we’d love to talk to you too.
It has been an awesome journey for us at LinkedIn, wish us luck on this next adventure.
This blog post announces the launch of the APAC deep dive of the data streaming report.
Mike Wallace is the new GM for the Public Sector, bringing 25 years of experience to lead the expansion of data streaming use by government agencies.