[Webinar] Why Real-time Data Streaming is the Foundation of Modern AI Strategies | Register Now

Disaster Recovery in 60 Seconds: A POC for Seamless Client Failover on Confluent Cloud

Verfasst von

I’ve worked with Apache Kafka® since 2019, and deciding how to design and implement client failover was a sticking point in almost every use case I dealt with. Even for Confluent customers—who have the benefit of features such as Confluent Replicator, Multi-Region Clusters, and Cluster Linking—ensuring seamless failover between Kafka environments is a challenging problem.

So a couple of months ago, I decided to explore how to build a disaster recovery failover orchestration using Confluent Cloud Gateway to design and implement seamless client failover with Confluent. This post covers the assumptions I made while developing a proof of concept (POC) for an approach that enables Kafka disaster recovery within 60 seconds.

Dealing With an Apache Kafka® Cluster or a Cloud Service Provider Outage

When considering how to handle a disaster, two key requirements come into play: the recovery point objective (RPO), which defines how much data loss is acceptable, and the recovery time objective (RTO), which defines how quickly the service must be restored.

In practice, the most common set of requirements we observe is a low—but not zero—RPO, combined with a low RTO. While there are now many tools and patterns available to achieve low RPO and limit data loss, the challenge today is more often on the RTO side.

Neither Kafka client applications (e.g., producers and consumers) or their server components (e.g., Kafka brokers, ZooKeeper or KRaft, Kafka Connect, Kafka Streams) are designed to be natively resilient to a full failure. Thanks to tools such as Cluster Linking, multi-cluster deployments have become the norm, but Kafka clusters don’t have a built-in awareness of replicated data in another cluster.

So you have to tackle both sides: server and client. With that in mind, let’s start by assuming that your Confluent Cloud architecture uses an active-passive, multi-region disaster recovery setup. This is the most common approach we see customers adopt, as it helps satisfy a strict RPO while minimizing data loss.

Digging into RPO/RTO With a Server-Side Outage

Unfortunately, you can't get RPO of zero without synchronous replication, and in the Kafka world, only Confluent Platform Multi-Region Clusters have it. So achieving an RPO of zero with an active-passive cluster is not possible. But we can tend toward zero data loss with the right approach.

Multiple asynchronous tools are available: 

  • Confluent Cloud Cluster Linking: This is a fully managed service in Confluent Cloud used to replicate data from a source cluster to a destination cluster.

  • Confluent Replicator: This proprietary solution is a self-managed connector based on Kafka Connect, which is available via Confluent Marketplace.

  • Kafka Mirror Maker 2: Part of the open source Kafka project, this self-managed component is based on Kafka Connect and copies messages from one cluster to another in real time.

Confluent Cloud Cluster Linking offers more advantages than other solutions, including offset preservation, access control lists (ACL), and topic partition replication. The fully managed service also eliminates the operational burden added by self-managed replication. It’s the right choice for server‑side disaster recovery and failover in Confluent Cloud because it provides fully managed, low‑latency replication that preserves critical metadata end to end, ensuring seamless continuity without the operational complexity of self‑managed tooling.

Data is replicated from active clusters to passive clusters across cloud regions using Cluster Linking in a Confluent Cloud environment.

Even though Cluster Linking handles data replication, a set of manual operations is still required to promote the topics on the disaster recovery cluster to read/write mode.

Your RTO will largely depend on your ability and responsiveness to carry out these operations, whether through automated scripts, manual API calls, or other procedures.

Digging into RPO/RTO With a Client-Side Outage

On the client side, everything becomes more complex. Every Kafka client requires a bootstrap.servers configuration, the list of host/port pairs used to establish the initial connection to the Kafka cluster. Kafka producers and consumers aren’t designed to change these cluster endpoints without restarting the application itself.

Properties props = new Properties();
props.put("bootstrap.servers", "active-cluster.confluent.cloud:9092");
//other props

KafkaProducer<String, String> producer = new KafkaProducer<>(props);
String key = "key", val = "value";
final ProducerRecord<String, String> record = new ProducerRecord<>(topic, key, val);
producer.send(record);

So one client is able to talk to one specific Kafka cluster; if that cluster fails, the clients will be stuck until the cluster returns or fails with a timeout exception.

How Kafka Clients Behave Before and After Cluster Endpoint Failure

In terms of RPO, the Kafka producer has a default delivery timeout of two minutes; after that time passes, records are considered expired. So if the switchover operation takes longer than two minutes, you may risk losing some messages.

During the outage, Kafka producers will bufferize messages internally until the connection with the cluster is fixed. (Check out some key parameters like delivery.timeout.ms, buffer.memory.) As soon as the cluster is running again, Kafka producers will flush these local messages without data loss. Keep in mind that these messages are still in memory, so if you need to restart your producer, you need to change your bootstrap servers. You may lose some messages. While Cluster Linking’s high-fidelity, zero-ops replication capabilities solve failover challenges on the server side, seamless Kafka client and server-side failover is still hard to achieve because of the client side of the equation. That’s why I started to explore how to combine replication with Confluent Cloud Gateway to ensure resilience even when the active cluster is down.

How a Kafka Gateway Can Make Client and Server Failover More Seamless

When the active Kafka cluster in an active-passive architecture fails, clients that point to the failed cluster have no idea that an exact copy of the cluster they need to connect with exists. To address this gap without having to restart your clients, you must deploy a self-managed component—such as a TCP proxy with a custom DNS—to sit between the application and the brokers and serve as a gateway that can intercept the request and redirect to the correct Kafka cluster.

Confluent Cloud Gateway serves as the primary traffic control layer, automatically redirecting all Kafka client traffic to the passive cluster when the active cluster becomes unavailable. This failover applies whether the outage affects a single cluster or an entire cloud service provider (CSP) region, ensuring continuity without requiring client-side changes.

Using Confluent Cloud Gateway to intercept requests between Kafka Clients and active clusters to connect to passive clusters during an outage.

The gateway needs to be very resilient to broker, cluster, or CSP region outages. When your active cluster fails, this setup still requires manual action: You must trigger a failover in the gateway by closing the in-flight connections and switching to passive mode. That will prompt Kafka clients that use the active cluster as an endpoint to re-bootstrap the connection and resume their jobs, but now they’ll produce messages to or consume messages from the passive cluster.

But be careful—it's trickier than it seems. Mirror topics created with Cluster Linking are read-only by default to preserve the partition offset. With a read-only passive cluster, when the gateway switches to passive mode, the producer won’t be able to write messages to its topics and will get the following error: “Cannot append records to read-only mirror topic XXXX.”

To avoid losing messages, you need an orchestrator to wrap the failover and manage Cluster Linking in the same envelope. The workflow could be: 

  1. Dropping in-flight connections

  2. Reversing the cluster linking flow, i.e., start mirroring from the new active to the old one

  3. Switching the gateway to the passive cluster

  4. Accepting new connections from Kafka clients

A Kafka Gateway for Seamless Failover – From Conception to Demo

An aside

This demo is a POC and is not production-ready. Use at your own risk.

Confluent recently released Confluent Cloud Gateway, a self-managed Kafka protocol proxy that can address multiple use cases, such as:

  • Migration of on-premises clients to Confluent Cloud without client changes 

  • Disaster recovery switchover from an unhealthy cluster to a healthy cluster without client changes, achieving a significant reduction in recovery time 

  • Secure external partner access for a private cluster

Over the last year, I’ve been tinkering with an idea about building a gateway that wraps the client failover and manages the replication mechanism to enable more seamless failover in Confluent Cloud environments. That’s why I built this Kafka Gateway demo in my spare time.

This is basically a service where you can spin up a gateway and trigger a failover (and a failback) when your Confluent Cloud clusters fail—without needing to restart your Kafka clients. Additionally, it can manage the replication process with Cluster Linking.

Architecture of the Kafka Gateway demo: Wrapping client failover and the Cluster Linking replication mechanism in the same envelope.

Scope of This Demo

The scope of this demo includes: 

  • Identity Passthrough to Confluent Cloud. Client credentials are forwarded directly to Kafka clusters without modification. There is no authentication swap, so your credentials setup is exactly the same as when you connect to Confluent Cloud. 

  • OAuth2 Support. Failover is supported only for Kafka Clients that use OAuth2 as the authentication mechanism. Because the classic API key/secret is scoped per Kafka Cluster in Confluent Cloud, you can’t trigger a failover without changing your credentials. OAuth2 and Identity Pool in Confluent Cloud are scoped per organization, so you can keep the same client ID/secret for accessing multiple Kafka clusters. 

  • Data Replication. Cluster Linking is managed directly when failover is triggered to avoid data loss and minimize your RTO.

  • Single-Region Support Only. Multi-region is planned for a future iteration.

Failover and Failback in One Click

The main idea of this POC was to be able to trigger a failover and a failback with just one click with the lowest RTO possible.

Triggering a failover is easily accessible in the portal user interface (UI). Two options are available. 

  • Unplanned: When your main cluster faces a real outage (meaning it’s no longer available), you can’t close the cluster linking properly. So mirror topics will be promoted in read/write mode, and the gateway will be re-bootstrapped with the passive cluster. 

  • Planned: Both clusters are available, so the connection is closed gracefully.

Failing back can be triggered when your old active cluster is up and ready again and you’d like to restore the normal flow. Two options are available again. 

  • Recover first: The gateway will first truncate and restore and will then reverse the replication flow. (However, Cluster Linking does not support transactions on mirror topics. Be careful during the failback process; this can lead to side effects.)

  • Do not recover: Do the reverse failover immediately without recovering data. Some data may be lost.

Try it out.

Limitations to Keep in Mind

Remember, this is a demo and is not production-ready. You shouldn’t use this platform for a production use case. Some limitations are still present today, such as:

  • The demo supports only Confluent Cloud Dedicated clusters with public endpoints.

  • The gateway must be deployed in a single cloud region.

  • The gateway is scoped only to Confluent Cloud Kafka clusters. Schema Registry is not part of the POC yet.

Ready to try it out? Get started with Confluent Cloud.

Next Steps – Start Implementing Seamless Kafka Connectivity

Because the gateway becomes the most important component of all your event-driven systems, we’re absolutely certain that enterprise support will be essential for what’s coming next.

As we already mentioned, Confluent has introduced Confluent Cloud Gateway, and this self-managed, cloud-native Kafka proxy is helping redefine how organizations connect, secure, and manage their Kafka environments at scale.

With Confluent Cloud Gateway, teams can say goodbye to complex broker lists, inconsistent security settings, and the operational pain of direct client-to-cluster connections. Instead, Confluent Cloud Gateway provides a stable, intelligent, and protocol-aware entry point that makes client connectivity effortless and reliable.

One of its standout capabilities is automatic disaster recovery switchover, which seamlessly reroutes clients from an unhealthy cluster to a healthy one without any configuration changes. The result? Dramatically reduced recovery times and improved business continuity.

For those looking for a fully managed experience, Confluent Cloud is already working on an integrated client switchover feature that will bring the same resilience and simplicity directly to your Confluent Cloud clusters without any gateway components to manage.

Stay tuned. The future of truly seamless Kafka connectivity is closer than ever.


Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, and the Kafka logos are registered trademarks of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of these marks.

  • Sylvain works as a Senior Customer Success Technical Architect at Confluent, advising enterprise clients on architecture design, security, and best practices to help them implement scalable and reliable data-streaming platforms. He coordinates across product, engineering, sales, and support teams to ensure successful deployment and long-term value from Confluent’s solutions. He guides customers through technical life cycles—from planning and migration to performance tuning and operational maturity—and help them adopt industry-standard approaches for real-time data streaming.

Ist dieser Blog-Beitrag interessant? Jetzt teilen