This is the third month of Project Metamorphosis, where we discuss new features in Confluent’s offerings that bring together event streams and the best characteristics of modern cloud data systems. In the first two months, we talked about elasticity and cost. For this month, I’d like to talk about infinite storage for Apache Kafka®. Confluent Cloud CKUs today have storage limits, but not for long. Rolling out soon, all Standard and Dedicated Clusters will have no storage limits.
As enterprises become more digital and transition more towards real time, many companies realize the importance of having a centralized event streaming platform to handle the billions of interactions their customers have across their applications and services. Apache Kafka has become the standard for event streaming, but the typical setup is to store data in Kafka for days or weeks.
Kafka platform operators have needed to balance offering their developers longer data retention periods and their infrastructure costs—the longer the retention period the more hardware that is required. This is no longer a trade-off Kafka platform operators need to make in Confluent Cloud. Operators just pay for the data that is sent to Confluent Cloud and the system just scales. No pre-provisioning of storage, no waiting to scale, Confluent Cloud’s storage layer is truly infinite. Best yet, consumers that need more time to catch up (longer data retention times) don’t affect the performance of real-time clients.
The way Confluent Cloud storage now scales is a game-changer when it comes to operating a Kafka cluster, but why infinite?
I have talked to many Confluent customers over the years. Quite a few of them, after using Kafka for some time, started asking the question whether Kafka can be their system of record.
Within their enterprises, Kafka is often the very first point where all types of digitized data are integrated together. This integration effort is expensive, and you really just want to do it once. Therefore, Kafka is becoming the single source of truth for all other systems (data warehouse, search engine, monitoring systems, etc.) and applications that require this integrated digitized data. Those systems and applications typically obtain data from Kafka incrementally in real time. However, historical data, when combined with current information, can further increase the accuracy of insights and overall quality of customer experience. So, occasionally, those systems and applications may also need to access historical data.
The common practice today is to maintain historical data in a separate system and to direct the application to that system when historical data is needed. This adds complexity in that every application has to deal with an additional data source other than Kafka. Developers have to use two sets of APIs, understand the performance characteristics of two different systems, reason about data synchronization when switching from one source to the other, etc.
Imagine if the data in Kafka could be retained for months, years, or infinitely. The above problem can be solved in a much simpler way. All applications just need to get data from one system—Kafka—for both recent and historical data. What might this look like in practice?
Tiered Storage, which is in preview in Confluent Platform, was built upon innovations in Confluent Cloud, which is combined with various other performance optimization features to deliver on three non-functional requirements: cost effectiveness, ease of use, and performance isolation.
If your organization has been using or is considering using Kafka, it would be useful to think through how infinite retention can help you. For example, could your existing architecture be simplified with Kafka being the system of record? Could new applications leverage both real time and historical data to provide a better user experience?
As I mentioned earlier, we’re enabling infinite retention in Confluent Cloud on AWS clusters in July, with the other cloud providers coming after. To get started, you can use the promo code CL60BLOG for $60 of additional free Confluent Cloud usage.* No pre-provisioning is required, and you pay for only the storage being used. Stay tuned for future announcements and watch the demo, which shows how you can scale from zero to 1 PB of data effortlessly with Confluent Cloud.
Jun Rao is the co-founder of Confluent, a company that provides an event streaming platform on top of Apache Kafka. Before Confluent, Jun Rao was a senior staff engineer at LinkedIn where he led the development of Kafka. Before LinkedIn, Jun Rao was a researcher at IBM’s Almaden research datacenter, where he conducted research on database and distributed systems. Jun Rao is the PMC chair of Apache Kafka and a committer of Apache Cassandra.