Ahorra un 25 % (o incluso más) en tus costes de Kafka | Acepta el reto del ahorro con Kafka de Confluent

Oct 1, 2025Lecturas: 20 min

Cross-Data-Center Apache Kafka® Replication: Decision Framework & Readiness Playbook

Escrito por

Riya SingalCloud Enablement Engineer
Confluent Staff

Oct 1, 2025Lecturas: 20 min

Building distributed systems is a huge undertaking, but the complexity doesn’t end once your application or platform is “production ready.” Keeping these systems online and operational through cloud region outages, a network partition, or just scheduled maintenance is a constant challenge.

The bottom line: you don’t want data pipelines for essential business services, customer-facing products, or enterprise data platforms to go dark. That’s where cross-data-center replication in Apache Kafka® becomes essential.

In this post, we’ll walk you through how to set up Kafka for cross-data-center replication using Kafka MirrorMaker, explore design considerations, and share practical insights on what actually works in real-world deployments.

Did you know? Cluster Linking in Confluent Cloud makes cross-data-center replication faster, more cost-effective, and available across public clouds. Try it for free and see for yourself.

Get Started Today

Why Cross-Cluster Replication Matters

Kafka has long been the backbone for real-time, high-throughput data streaming. But in production, teams need more than just performance; they need geo-redundancy, high availability (HA), and disaster recovery (DR) that spans across cloud regions or physical data centers.

In today’s interconnected systems, ensuring uninterrupted data flow between environments is a foundational requirement. Whether you're syncing services across continents or preparing for failover during outages, replicating Kafka topics across clusters ensures that your applications can keep running even when parts of your infrastructure go down. For a deeper understanding of geo-redundant Kafka architectures and replication strategies, check out “Multi-Geo Replication 101 for Apache Kafka: The What, How, and Why.” If you’re l

From disaster recovery and seamless data migration to compliance-driven multi-region storage, replication ensures:

Business continuity
Data durability
Minimal downtime
Better latency for region-specific consumers

Looking for a step-by-step tutorial to get hands-on experience? Get started with Apache Kafka 101 or dive deeper with the Hybrid and Multicloud Architecture course on Confluent Developer.

Replication Patterns

Before jumping into tools, it helps to think about the “shape” of your replication setup. There are three big decisions:

Traffic Handling: Active-Passive vs. Active-Active

Active-Passive: One cluster does the work, another waits in the wings. Simpler to manage, cheaper, but you’ll need a failover step if the primary fails.
Active-Active: Both clusters actively handle traffic. This means near-zero downtime, but it’s more complex and costs more to run.

Where You Replicate: Within a Region (AZs) vs. Across Regions

Across Availability Zones (AZs): Great for high availability within a region, with low latency. Protects you from single AZ outages.
Across Regions: Adds disaster resilience and helps meet compliance rules, but comes with higher latency and transfer costs.

Recovery Goals: RTO and RPO

Recovery Time Objective (RTO) = How quickly you can be back online after a failure.
Recovery Point Objective (RPO) = How much data you can afford to lose.
Lower RTO/RPO means faster recovery and less data loss but also higher complexity and cost.

In short, figure out your tolerance for downtime and data loss, then pick the pattern and scope that fit your needs (and budget).

Using Kafka MirrorMaker to Replicate Topics

Once you know what you need, you need a way to do it and that’s where Kafka MirrorMaker comes in. Kafka MirrorMaker is the go-to tool for replicating topics from one Kafka cluster (source) to another (target). It's ideal for:

Cross-region data synchronization
Multi-cloud or hybrid deployments
Backup and disaster recovery use cases
Global data distribution

By streaming messages in near real-time from a source cluster to a destination cluster, MirrorMaker helps ensure:

Consistent data availability across clusters
Fault tolerance across infrastructures
Higher system resilience during outages or failures

If your systems operate at scale or span geographies, mastering MirrorMaker is a must. It empowers you to design multi-cluster architectures that are robust, flexible, and production-grade.

Using MirrorMaker 2 to Design Resilient Multi-Cluster Architectures

Using MirrorMaker 2 to Design Resilient Multi-Cluster Architectures

Your Operational Readiness Checklist for Replicating Data With MirrorMaker

Below is a high-level overview of how to configure Kafka MirrorMaker for reliable data replication. These eight steps apply to both MirrorMaker 1 (MM1) and MirrorMaker 2 (MM2).

Install Kafka on both clusters. Download and install Apache Kafka on both your source and target clusters. If you're using an older version of Kafka, make sure ZooKeeper is properly configured.
Configure source and target clusters. Define the Kafka broker addresses for both clusters. Confirm network connectivity between the source and target clusters.
Create configuration files. Prepare consumer.properties to consume data from the source cluster and producer.properties to produce data to the target cluster.
Set up MirrorMaker 2 connectors. Use connect-mirror-maker.properties to configure MirrorMaker 2. Specify the source and target clusters. Define replication policies using replication.policy.class and select the topics to replicate.
Launch MirrorMaker. Start the MirrorMaker 2 process using the Kafka Connect framework: bin/connect-mirror-maker.sh config/connect-mirror-maker.properties
Monitor the logs for any errors or warnings to ensure the replication is running smoothly.
Verify data replication. Use Kafka consumer CLI commands to validate that messages are being mirrored. Check topic offsets and consumer lag to ensure consistency.
Optimize and monitor replication. Monitor performance using tools like JMX, Prometheus, or the Kafka UI. Tune parameters such as fetch.min.bytes, compression.type, and batch.size for better performance.

Key Features of Kafka MirrorMaker 2 to Know

Kafka MirrorMaker 2 (MirrorMaker 2) builds on the original MirrorMaker with a more powerful and flexible architecture, thanks to its integration with Kafka Connect. Whether you're syncing clusters across regions or planning for disaster recovery, MirrorMaker 2 offers a robust toolkit for modern data replication.

Here are five standout features that make MirrorMaker 2 a valuable component in any distributed Kafka setup:

Feature	Capabilities	Benefits
Efficient Data Replication Across Clusters	Ensures consistent data availability in both source and target environments Supports use cases like redundancy, disaster recovery, and hybrid/multicloud architectures Built for streaming at scale, handling high-throughput replication with ease	With MirrorMaker 2, your data pipelines stay synchronized even if they span across physical data centers or cloud platforms.
Multi-Cluster Replication Across Regions and Environments	Replicate topics between clusters in different cloud regions, data centers, or availability zones Power global event streaming, ensure low-latency access, and support compliance with data residency regulations Simplifies setting up Kafka in multi-region architectures	This makes MirrorMaker 2 a strong choice for companies operating at a global scale.
Offset Syncing for Consumer Groups	Keeps track of consumer progress across clusters Enables seamless failover consumers in the target cluster can pick up right where they left off in the source Useful for application migration, blue/green deployments, and DR testing	No more reprocessing old data or losing your place MirrorMaker 2 handles it for you.
Selective Topic Replication	Specify patterns or exact topic names in the configuration (e.g., topics=metrics-*,user-events) Reduces bandwidth usage and storage costs Gives you fine-grained control over what gets synchronized between clusters	Unlike MirrorMaker 1, which often required full-cluster replication, MirrorMaker 2 lets you choose exactly which topics to replicate. That’s especially helpful when only certain parts of your Kafka workload require replication.
Automatic Recovery From Failures	Automatically detects and recovers from cluster outages or network disruptions Resumes replication once the connection is restored Keeps replicated topics in sync with minimal manual intervention	MirrorMaker 2 is built to handle instability gracefully. This built-in resilience ensures high availability, fault tolerance, and reduced risk of data loss all without needing complex external orchestration.

Whether you're aiming for global data availability, seamless application migration, or disaster-proof pipelines, MirrorMaker 2 delivers with power, flexibility, and automation.

Best Practices for Operators Using Kafka MirrorMaker

To get the most out of your replication strategy, it’s crucial to follow best practices that minimize lag, avoid disruptions, and keep your data pipelines healthy.

In this section, we’ll explore real-world tips, common pitfalls, and troubleshooting techniques to help you confidently operate Kafka MirrorMaker in production environments.

Configuration Issues

MirrorMaker 2 relies on both Kafka Connect and a web of configuration files for consumers, producers, and clusters. Even a small misconfiguration can cause large-scale issues like data loss or complete replication failure.

Common symptom: High error rates, missing data in the target cluster.

Solutions:

Validate your consumer.properties and producer.properties files. Confirm that ACLs and authentication mechanisms (SASL, TLS, etc.) are correctly set up.
Test cluster connectivity before deployment.

Replication Lag and Delays

Replication lag is a silent killer in real-time systems. Left unchecked, it can break SLAs, delay insights, and create confusion between source and mirrored data.

Common symptom: Delayed or inconsistent data across clusters

Solutions:

fetch.min.bytes and fetch.max.wait.ms: Adjust these to reduce fetch latency.
batch.size and linger.ms: Tweak for better throughput without increasing lag.
Scale horizontally by increasing the number of partitions and MirrorMaker workers to distribute the load more evenly.

Memory and JVM Crashes

MirrorMaker is JVM-based and memory-intensive, especially during high-throughput replication. Default memory settings may not be sufficient for large workloads.

Common symptom: Frequent restarts, out-of-memory errors, or JVM crashes

Solutions:

Increase heap memory allocation for mirrormaker
Use message compression to reduce memory pressure.
Monitor batch.size

Brokers Connectivity Failures

If MirrorMaker can’t reach your Kafka brokers, nothing else matters. Network failures and misconfigured endpoints are common causes of replication breakdowns.

Common symptom: MirrorMaker can't start or continuously fails to connect

Solutions:

Verify bootstrap.servers in all configuration files.
Ensure both clusters are reachable via network and not blocked by firewalls, VPNs, or security groups.
Confirm that security settings (SSL certs, SASL credentials) are valid and up to date.

Topic Mismatches and Missing Data

Kafka won't replicate topics that don't exist on the target cluster (unless topic auto-creation is enabled). This leads to data silently not being mirrored.

Common symptom: Data missing in target cluster

Solutions: Either enable topic auto-creation on the destination cluster or pre-create the required topics manually with matching configurations.

Partition Workloads Imbalance

An imbalanced workload (where some partitions do all the work) can reduce throughput and increase latency.

Common symptom: Some partitions overloaded while others remain idle

Solutions:

Deploy multiple MirrorMaker instances.
Enable round-robin partition assignment

Handle Offset Translation with Care

In a mirrored setup, consumer offsets must be properly mapped to avoid reprocessing or message loss especially important if you're failing over between clusters.

Common symptom: Consumers replaying old data

Solution: Use MirrorMaker 2’s offset sync feature to track and translate offsets between clusters.

Storage and Retention Planning

As data flows into multiple clusters, storage requirements grow fast. Without retention policies, your brokers may run out of disk space, slowing down replication or even crashing.

Common symptom: Kafka brokers getting full, slow replication, or errors in log segments

Recommendations:

Apply appropriate log retention policies (log.retention.hours, log.segment.bytes) to control storage usage.
Explore tiered storage or cloud storage extensions if available in your setup.

Kafka MirrorMaker can be a rock-solid solution for cross-cluster data replication if configured and monitored properly. By following these best practices, you will avoid the most common pitfalls and build a more reliable, scalable, and self-healing Kafka architecture.

Looking for a Kafka replication solution that minimizes your operational burden and maximizes your scalability and cost savings? Watch this on-demand webinar to learn how Confluent Cloud helps lower costs of self-managed and hosted Kafka services by up to 70%.

Watch Webinar

Decision Guide to Ensuring High Availability and Disaster Recovery in Your Kafka Deployments

After setting up a cross-region Kafka replication pipeline with MirrorMaker 2, the next critical step in designing a resilient Kafka architecture is to leverage cloud availability zones (AZs). Cloud providers like AWS, GCP, and Azure offer multiple AZs within a single region each being an isolated data center with independent power and networking. These can be powerful tools for minimizing downtime and data loss.

High Availability (HA) With Availability Zones

Within a single cloud region, availability zones help mitigate the impact of hardware or zone-level failures. To achieve High Availability in Kafka using AZs:

Distribute Kafka Brokers Across Availability Zones

Run your Kafka brokers across at least three AZs. This geographic distribution ensures that even if one AZ goes down due to a power outage or a hardware issue, Kafka remains operational with the remaining brokers.

Kafka's architecture is designed for such resilience: partition replicas are automatically redistributed, and leadership elections ensure continuity without manual intervention.

Set Replication Factor Strategically

A replication factor of 3 is recommended in a 3-AZ setup. This ensures that each Kafka partition has two additional replicas hosted in different AZs.

So even in the worst-case scenario of a complete AZ failure, no data is lost, and a leader replica is still available to serve consumers and producers. This setup avoids both data unavailability and reprocessing errors.

Enable Rack Awareness in Kafka

Kafka supports rack awareness using the broker.rack configuration. You can assign each broker to a logical "rack" (in this case, map each AZ to a rack). Kafka’s internal replica placement strategy then ensures no two replicas of the same partition reside in the same AZ.

This intelligent replica placement is critical for avoiding single-zone bottlenecks or correlated failures. It also helps ensure that replication traffic remains efficient across zones while maintaining fault tolerance.

Disaster Recovery (DR) With Multi-Region Deployments

While availability zones help you survive zone-level disruptions, they don’t protect against region-wide outages (e.g., due to natural disasters or large-scale cloud issues). This is where multi-region Kafka architecture and DR planning come into play.

Deploy Kafka Clusters in Multiple Regions

Set up an active-passive or active-active Kafka deployment across two (or more) cloud regions. One region functions as the primary, and the other serves as the DR or secondary site.

This secondary cluster can remain on standby or serve lighter workloads, depending on your tolerance for latency and consistency.

Use MirrorMaker 2 for Asynchronous Replication

As discussed earlier, MirrorMaker 2 can replicate Kafka topics and consumer group offsets from the primary region to the secondary region in near real-time. MirrorMaker 2 supports topic renaming and offset syncing, enabling a seamless switchover when needed.

This asynchronous replication avoids write-latency issues across geographically distant regions while maintaining a recoverable backup.

Plan and Test a Failover Strategy

In the event of a regional failure, you can promote the DR cluster to primary. Since MirrorMaker 2 keeps both messages and offsets in sync, consumers can resume processing from the correct point with no data loss, no duplicates.

However, successful failover depends on infrastructure readiness:

DNS rerouting or service discovery updates
Application-level retry logic
Regular DR drills to validate readiness

Why This Architecture Matters

Combining availability zones for HA and regions for DR offers the best of both worlds:

Fault Isolation: Issues in one AZ or region don’t bring down your entire Kafka platform.
Seamless User Experience: End-users and services continue consuming/producing data without interruption.
Cost Efficiency: With strategic planning, secondary regions and zone-level redundancy can be provisioned to balance resilience and cost.

Replicating Schemas and Registry Data for Disaster Recovery

When planning disaster recovery for Kafka, it’s not enough to only replicate the messages. Kafka messages are often serialized using formats like Avro, Protobuf, or JSON Schema, and those formats depend on the Schema Registry to provide the exact schema definition.

If a DR cluster receives data but doesn’t have the correct schema, consumers won’t be able to deserialize the messages. This makes schema replication just as important as data replication.

Below are the main strategies to ensure schemas and registry metadata are available in your DR setup.

Using Confluent’s Schema Linking

Confluent provides Schema Linking, which continuously and automatically replicates schema subjects, versions, and compatibility settings between registries in different regions. This ensures that the DR registry is always in sync with the primary. It works seamlessly alongside Confluent’s Cluster Linking for Kafka topics, providing a complete replication solution.

Confluent Replicator in Schema Mode

Confluent Replicator can operate in a special schema translation mode that focuses on copying everything stored in the Schema Registry (e.g., schema subjects, their versions, and the associated configuration settings from the primary environment to a DR environment).

This replication can be set to run continuously (real-time updates) or on a scheduled basis (periodic syncs), ensuring the DR environment is always ready if a failover is required.

How it’s done:

One-Time Migration
- Used when setting up the DR registry for the first time.
- Copies all existing schema subjects, versions, and settings from the primary registry into a fresh, empty DR registry.
- Provides a baseline copy to build from.
Continuous Synchronization
- Keeps both registries aligned in real time after the initial migration.
- Automatically replicates any new schemas or updates made in the primary registry to the DR registry.
- Ensures that a switchover can happen instantly, without manual intervention.

Manual Export and Import

Another way to keep your DR Schema Registry updated is by manually exporting and importing schemas using the Schema Registry’s REST API. The API lets you list all registered subjects and their versions from the primary registry and then push them into the DR registry.

In practice, this means calling endpoints like /subjects and /versions to retrieve the full set of schemas, and then using POST requests to load them into the DR environment. You can wrap these calls in a small script and schedule it to run.

This approach is simple and doesn’t require additional tools, making it a reasonable choice for smaller environments or where schema changes are infrequent. However, because it’s not real-time, any schema changes made between syncs won’t be available in the DR registry. If a disaster occurs just before the next scheduled export, you could lose the most recent schema updates.

MirrorMaker 2 With a Custom Schema Sync Process

MirrorMaker 2 is very good at replicating Kafka topics across clusters, but it has one big gap: it doesn't handle schemas. That means if you rely solely on MirrorMaker 2 for disaster recovery, your DR cluster might have the messages but not the matching schemas to read them properly.

To fix this, you can pair MirrorMaker 2 with a small custom script or service that periodically queries the primary Schema Registry for all subjects and versions, and then registers them in the DR registry. This process runs alongside your topic replication, ensuring that both data and schemas stay in sync.

This approach works well for teams already invested in MirrorMaker 2 and comfortable maintaining a bit of custom code. It’s less polished than an out-of-the-box replication tool, but it gives you the flexibility to control how often schemas are synced and how errors are handled.

Infrastructure-as-Code Schema Management

In this approach, schema definitions are stored in version control (such as Git) and deployed to both the primary and DR registries using automation tools like Terraform or Ansible. By keeping schemas under version control, every change is documented, accidental mismatches between clusters are avoided, and rolling back to a previous version is straightforward.

This method works best when schema changes are deliberate and follow a defined deployment process. If schemas are created or updated directly in production without going through version control, a separate synchronization process will still be needed to keep the DR registry aligned.

Best Practices:

Always replicate schemas alongside Kafka topics to prevent deserialization failures during failover.
Replicate schema compatibility settings in addition to the schema definitions.
Include access controls and authentication settings in the DR replication process.
Periodically test failover scenarios to confirm that consumers function correctly in the DR environment.

Guide to Preparing for Region-Wide Failures

While availability zones provide high availability within a single cloud region, true fault tolerance demands preparing for region-wide outages. This is where geo-redundancy steps in ensuring that Kafka remains operational even if an entire cloud region or data center goes offline due to a natural disaster, power failure, or widespread network issue.

What Is Geo-Redundancy?

Geo-redundancy involves deploying Kafka clusters in multiple geographical regions and setting up asynchronous replication between them. The goal is to protect data and ensure service continuity during large-scale failures by replicating Kafka topics, configurations, and consumer offsets across regions.

Key Design Considerations for Geo-Redundancy in Apache Kafka

Designing Kafka to be geo-redundant means setting it up in a way that if one entire cloud region goes down, your data and services can continue to operate from another region. This is essential for high availability, business continuity, and disaster recovery. Below are the most important considerations when setting this up:

Replication Across Regions

By default, Kafka does not support cross-region replication out of the box, so we need external tools to replicate data and consumer state between regions.

Available Tools:

Apache MirrorMaker 2 (MM2): A popular open-source tool that can mirror Kafka topics and consumer offsets in near real-time. It’s flexible and widely used in open-source Kafka environments.
Confluent Cluster Linking: A more streamlined solution built into Confluent Kafka that allows you to replicate topics natively between clusters without needing connectors. It’s easier to set up and manage than MM2 but requires the Confluent platform or Confluent Cloud.

These tools ensure that your critical topics and consumer offsets are mirrored to a disaster recovery (DR) region so that in case of a regional failure, the backup region is up-to-date and ready to take over.

Deployment Models: Active-Passive vs. Active-Active

When deciding how to operate Kafka across multiple regions, there are two main patterns:

Active-Passive Setup

Only one Kafka cluster is actively handling all the traffic.
The second cluster stays on standby, ready to take over if the primary fails.
In the event of an outage, you promote the standby cluster to active.
Pros: Easier to manage, avoids conflicts in data or topic structures.
Cons: Slight delay in switching, some downtime during failover.

Active-Active Setup

Both Kafka clusters are actively handling traffic at the same time, usually split by geography or business logic.
This model is ideal for global applications that require low-latency data access from multiple continents.
However, it brings complexity:
- Topic name conflicts can happen if both regions write to the same topic.
- You need clear strategies for data reconciliation and partitioning.
Pros: Better performance and availability.
Cons: Much more complex to manage.

Read the white paper, “Best Practices for Multi-Region Apache Kafka® Disaster Recovery in the Cloud (Active/Passive)”, to learn more.

Offset Syncing and Failover Strategy

It's not enough to just replicate topic data, as your consumer offsets (the position of each consumer in the topic) also need to be synced. Without this, consumers in the backup region wouldn’t know where to resume reading.

Best practices:

Use MM2’s offset sync feature to ensure the backup region knows the latest committed offsets.
During failover:
- Reconfigure consumers or update DNS/service discovery so they start reading from the DR cluster.
- Promote mirrored topics to become writeable.
- Ensure failover is smooth and doesn’t cause message duplication or data loss.
Keep a runbook or playbook that documents every step to perform during a failover.

Managing Latency and Bandwidth

When you replicate Kafka data across distant cloud regions, latency and network bandwidth become critical concerns.

Best practices:

Use high-speed interconnects like:
- AWS Direct Connect
- Azure Global VNet Peering
Monitor replication lag carefully even a few seconds of lag can matter for real-time applications.
Not all topics are equally important to replicate high-priority topics more frequently or with tighter monitoring.

Security and Compliance

Geo-redundant Kafka deployments must remain secure and compliant with regulations across countries.

Best practices:

Mirror all security configurations such as:
- Access Control Lists (ACLs)
- User credentials
- Role-based access controls (RBAC)
Ensure TLS encryption is enabled for all communication between regions.
Replication tools must also be authenticated to prevent misuse.
Be aware of data residency laws (e.g., GDPR) ensure cross-border replication is legally permitted.

Monitoring, Automation and Infrastructure as Code (IaC)

Running Kafka in multiple regions means more infrastructure to manage automation becomes critical.

Best practices:

Use Terraform, Pulumi, or other IaC tools to deploy Kafka clusters consistently across regions.
Set up automated health checks to monitor broker availability, replication lag, disk usage, etc.
Create alerts for:
- Replication failures
- Broker crashes
- Lag spikes
Automate failover tasks where possible to reduce downtime and human error.

Why Geo-Redundancy in Kafka Matters

Implementing geo-redundancy ensures that your systems can survive major regional outages and continue to deliver real-time data. Here’s what you gain:

Resilience: If an entire cloud region goes down, your DR region can keep things running.
Minimized Data Loss: Near real-time replication keeps the secondary cluster almost in sync.
Better Compliance: Data durability and backup satisfy requirements in regulated industries like finance and healthcare. Faster Recovery: With automation and clear failover plans, you can get back online in minutes, not hours.

Operational Approaches for Replicating Kafka Topic Configurations Across Clusters

When you replicate Kafka data between clusters whether for disaster recovery, geo-redundancy, migration, or multi-cluster management it’s not enough to just copy the messages. You also need to make sure topic configurations match.

These configurations include things like:

Partition count
Replication factor
Retention settings
Custom overrides (e.g., cleanup.policy, min.insync.replicas)

Keeping them in sync ensures that your applications behave the same way no matter which cluster they connect to. Here’s how you can do it.

Native Tools and Mechanisms

MirrorMaker 2 (MM2) – Apache Kafka
Purpose: Replicates both topic data and (optionally) topic configurations across clusters.
Important: Config sync is off by default. You need to explicitly set: sync.topic.configs.enabled=true
Even with this enabled, not all settings will be copied, especially broker-specific or read-only configs.
Changes in partition count can be mirrored if MM2 is configured correctly, but in some cases, you may need to restart replication or trigger a manual resync.
Cluster Linking – Confluent Platform / Confluent Cloud
Creates a direct link between a source topic and a target topic.
Streams both data and most configurations automatically while the link is active.
Automatically applies config changes from the source, which helps prevent drift.
Caveat: Some configs, especially broker-level overrides (like min.insync.replicas), may still differ if the target cluster has different defaults.
Confluent Replicator
Similar to MM2 but designed for Confluent-specific deployments with more fine-grained control.
Can copy topic configurations if you enable: topic.config.sync=true
Just like MM2, some configs (especially broker-specific or security-related ones) can’t be replicated.

Command-Line Utilities for Manual Workflows

If you’re not using an automated replication tool, the Kafka CLI can help:

kafka-topics.sh – View topic configurations before recreating them in another cluster: kafka-topics.sh --bootstrap-server <broker> --describe --topic <topic-name>
kafka-configs.sh – Export and apply specific configurations: kafka-configs.sh --bootstrap-server <broker> --entity-type topics --entity-name <topic-name> --describe

kafka-configs.sh --bootstrap-server <broker> --entity-type topics --entity-name <topic-name> --alter --add-config retention.ms=86400000

These are great for inspection, drift detection, or scripting, but not for continuous real-time sync.

Best Practices for Reliable Config Replication

Use Infrastructure-as-Code (IaC): Store your topic definitions (including configs) in a declarative format like Terraform, Ansible, or Kubernetes CRDs. Treat them like version-controlled code for reproducible environments.
Custom Automation Scripts: Use CLI tools to export topic configs and automatically replay them into the target cluster. This is useful if your clusters have different limits and need custom mapping.
Run Periodic Drift Checks: Even with automated tools (MM2, Replicator, Cluster Linking), periodically compare configs between clusters to ensure they still match.
Set Proper Permissions: Replication tools need ACLs to create, alter, and describe topics in both clusters.

Limitations of MirrorMaker for Disaster Recovery

MirrorMaker (especially MirrorMaker 2 or MM2) is a native Kafka tool designed to replicate topics across clusters, making it a popular choice for disaster recovery and multi-region replication.While MirrorMaker 2 (MM2) improves significantly over MM1, it still has some key limitations when it comes to building a truly resilient disaster recovery setup:

Asynchronous replication: MirrorMaker performs asynchronous replication, which means there’s always a delay between source and destination clusters. In the event of a failure, some unreplicated data could be lost. So, MirrorMaker cannot achieve RTO (Recovery Time Objective) = 0 or RPO (Recovery Point Objective) = 0.
Complex failback: Failing over to a secondary Kafka cluster is manageable, but returning operations back to the original (once it's back online) is complex. You'll need to carefully resync topics, reconcile offsets, and avoid reprocessing or data gaps.
Offset translation is imperfect: Although MM2 tries to map consumer offsets between clusters, it’s not foolproof. After a failover, consumers may skip messages or process some messages again, depending on sync timing.
Record ordering guarantees may be compromised: During failover or under high throughput, message order may not be preserved between clusters; this is acceptable for some use cases but problematic for strict-ordering requirements.

What to Consider Next

When MirrorMaker's trade-offs become blockers especially for large-scale environments or strict disaster recovery needs it helps to explore other tools in the ecosystem. Here’s a comparison of commonly used Kafka replication or integration tools:

Confluent Replicator

Confluent Replicator is a commercial tool offered by Confluent (the company behind Kafka). It’s built on top of Kafka Connect but adds enterprise-level features.

It supports advanced features like schema evolution, data filtering, monitoring, and error handling.
Since it’s managed, it reduces the operational burden on your teams.
It’s not open source so you’ll need a Confluent license.

Ideal for enterprises that want reliable, scalable replication with built-in monitoring and enterprise support.

Cluster Linking

Cluster Linking is a feature from Confluent that allows you to link Kafka clusters across regions without the need to duplicate brokers or run MirrorMaker.

It supports near real-time replication and works at the topic level.
It avoids some of the downsides of MirrorMaker, like offset mismatch or ordering issues.
Like Confluent Replicator, this is only available with Confluent’s platform.

Ideal for real-time cross-cluster syncing without the need for mirror brokers or complex consumer offset translation.

Kafka Connect

Kafka Connect is part of Apache Kafka itself and is designed for connecting Kafka to external systems (like databases, cloud storage, or data warehouses). It can also be used to move data between Kafka clusters.

It uses a pluggable architecture with a wide range of connectors, including ones for replicating topics.
Good if you’re already using Kafka for broader data integration tasks or want to manage data pipelines in a more modular way.

Ideal for ETL workflows, integrating with external systems, and when you need control over how data is handled during replication.

Apache Flink®

Apache Flink® is a powerful stream processing framework that can be used not only to process data in real-time but also to replicate it across systems and regions.

It allows you to apply transformations while replicating, which is useful when different regions need slightly different data.
It requires a separate deployment and has a steeper learning curve than MirrorMaker or Kafka Connect.

Ideal for teams that already use Flink for stream processing and want to combine it with data replication for more flexibility.

Apache NiFi™

Apache NiFi™ is a low-code, drag-and-drop tool for building data flows, including Kafka-to-Kafka replication.

It has a user-friendly interface and doesn’t require much programming knowledge.
It’s not the most efficient option for high-throughput or low-latency use cases, but works well for moderate data loads.
Also supports transformation and routing.

Ideal for organizations looking for a GUI-based, low-code way to replicate data and manage Kafka workflows.

Custom Producer or Consumer Applications

Some teams build their own replication logic using Kafka consumers (to read from a source cluster) and producers (to write to a destination cluster).

This gives you maximum control you can transform, filter, and route messages however you want.
But it also means you’re responsible for managing offsets, failures, retries, and data integrity.
This is time-intensive and can be error-prone if not carefully designed.

Ideal for niche or highly customized use cases that can’t be met by off-the-shelf tools, or when you want to tightly control data routing logic.

Start Implementing Cross-Data-Center Replication With Serverless Apache Kafka on Confluent

The strongest approach to cross-data-replication builds in layers starting with regional high availability, adding multi-region disaster recovery, and reinforcing it with a clear geo-redundancy plan.

It’s about ensuring your data stays available, consistent, and safe, even when parts of the system fail. If you’ve already worked through high availability within a single region, set up disaster recovery across regions, and planned for geo-redundancy, you’re ahead of most teams.

When you combine these with solid schema and topic replication, account for network and operational realities, and stay mindful of your architecture’s limits, you create a system that can adapt and recover quickly. The real test comes during an outage and if your replication strategy is well designed, those moments will pass without disrupting your business or your users.

Ready to put these strategies to the test for your self-managed or managed Kafka workloads? Get started for free with Confluent, wherever your data lives.

Get Started Today

Cross-Data Center Replication – FAQs

How do I choose between Active-Passive and Active-Active?

Start from business SLOs: if you need near-zero downtime and can safely partition writes by region or entity (to avoid conflicts), Active-Active fits—at higher cost/complexity. If write conflicts or compliance make dual-write risky, choose Active-Passive with a rehearsed promotion runbook and strict lag SLOs.

What’s the fastest way to spot DR risk before it bites?

Track a replication lag SLO on your critical topics and alert on breaches. Pair it with offset-sync freshness and a weekly drift report (topic configs + schema subjects). If any of these trend up, you’re eroding RPO or failover confidence.

How do I avoid offset surprises during failover?

Standardize on one offset-sync mechanism (e.g., MM2’s offset translation or your platform’s native feature), test it in drills, and add a post-promotion validation step: consumers start at expected offsets, duplicate/process-gap rate within agreed error budget.

What belongs in a replication runbook to pass an audit?

Owners & contacts, preflight checks (health, lag < SLO, schema/config parity), promotion steps (DNS/service discovery, topic promotion rules), validation checklist (produce/consume probes, URP=0, consumer lag trend), rollback, and a timestamped evidence bundle.

Apache®, Apache Kafka®, Kafka®, Apache NiFiTM, and NiFiTM are either registered trademarks or trademarks of the Apache Software Foundation (ASF). No endorsement by the Apache Software Foundation is implied by the use of these marks.

Riya is a Cloud Enablement Engineer and Confluent Certified Developer for Apache Kafka (CCDAK) with hands-on experience managing real-time streaming platforms across AWS and hybrid environments. She specializes in Kafka resource optimization, streaming integration with Kafka Connect, schema management, and event-driven microservices. With a background in computer science and engineering, Riya has designed and optimized scalable ingestion pipelines, automated infrastructure provisioning with Terraform, and implemented observability using Grafana, Dynatrace, and Datadog. Passionate about automation and data reliability, she helps organisations build resilient, scalable, and high-performance streaming systems on Confluent Cloud and Apache Kafka.
This blog was a collaborative effort between multiple Confluent employees.

¿Te ha gustado esta publicación? Compártela ahora

Best Practices for Validating Apache Kafka® Disaster Recovery and High Availability

Sep 30, 2025

Learn best practices for validating your Apache Kafka® disaster recovery and high availability strategies, using techniques like chaos testing, monitoring, and documented recovery playbooks.

Confluent Staff

How to Scale and Secure Kafka Connect in Production Environments

Sep 30, 2025

Learn best practices for running Kafka Connect in production—covering scaling, security, error handling, and monitoring to build resilient data integration pipelines.

Why Cross-Cluster Replication Matters

Replication Patterns

Traffic Handling: Active-Passive vs. Active-Active

Where You Replicate: Within a Region (AZs) vs. Across Regions

Recovery Goals: RTO and RPO

Using Kafka MirrorMaker to Replicate Topics

Your Operational Readiness Checklist for Replicating Data With MirrorMaker

Key Features of Kafka MirrorMaker 2 to Know

Best Practices for Operators Using Kafka MirrorMaker

Configuration Issues

Replication Lag and Delays

Memory and JVM Crashes

Brokers Connectivity Failures

Topic Mismatches and Missing Data

Partition Workloads Imbalance

Handle Offset Translation with Care

Storage and Retention Planning

Decision Guide to Ensuring High Availability and Disaster Recovery in Your Kafka Deployments

High Availability (HA) With Availability Zones

Distribute Kafka Brokers Across Availability Zones

Set Replication Factor Strategically

Enable Rack Awareness in Kafka

Disaster Recovery (DR) With Multi-Region Deployments

Deploy Kafka Clusters in Multiple Regions

Use MirrorMaker 2 for Asynchronous Replication

Plan and Test a Failover Strategy

Why This Architecture Matters

Replicating Schemas and Registry Data for Disaster Recovery

Using Confluent’s Schema Linking

Confluent Replicator in Schema Mode

Manual Export and Import

MirrorMaker 2 With a Custom Schema Sync Process

Infrastructure-as-Code Schema Management

Best Practices:

Guide to Preparing for Region-Wide Failures

What Is Geo-Redundancy?

Key Design Considerations for Geo-Redundancy in Apache Kafka

Replication Across Regions

Deployment Models: Active-Passive vs. Active-Active

Active-Passive Setup

Active-Active Setup

Offset Syncing and Failover Strategy

Managing Latency and Bandwidth

Security and Compliance

Monitoring, Automation and Infrastructure as Code (IaC)

Why Geo-Redundancy in Kafka Matters

Operational Approaches for Replicating Kafka Topic Configurations Across Clusters

Native Tools and Mechanisms

Command-Line Utilities for Manual Workflows

Best Practices for Reliable Config Replication

Limitations of MirrorMaker for Disaster Recovery

What to Consider Next

Confluent Replicator

Cluster Linking

Kafka Connect

Apache Flink®

Apache NiFi™

Custom Producer or Consumer Applications

Start Implementing Cross-Data-Center Replication With Serverless Apache Kafka on Confluent

Cross-Data Center Replication – FAQs

How do I choose between Active-Passive and Active-Active?

What’s the fastest way to spot DR risk before it bites?

How do I avoid offset surprises during failover?

What belongs in a replication runbook to pass an audit?

Get Started

Demo Center

¿Te ha gustado esta publicación? Compártela ahora

Suscríbete al blog de Confluent

Best Practices for Validating Apache Kafka® Disaster Recovery and High Availability

How to Scale and Secure Kafka Connect in Production Environments