Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

What is Change Data Capture?

Change data capture (CDC) refers to the process of tracking all changes in data sources, such as databases and data warehouses, so they can be captured in destination systems. CDC methods like transaction logs and database triggers allow organizations to achieve data integrity and consistency across multiple systems and deployment environments. Additionally, they allow organizations to use the right tool for the right job by moving data from legacy databases to purpose-built data platforms, such as document or search databases, or data warehouses.

Why Change Data Capture?

Initially, CDC became popular as an alternative solution to batch data replication for populating data warehouses for Extract Transform Load (ETL) jobs. In recent years, CDC has become the de facto method for migrating to the cloud.

Here are some of the main benefits of CDC:

  • Eliminates Bulk Load Updates: CDC enables incremental loading or real-time streaming of data changes into the target repository. There is no need for bulk load updates and inconvenient batch windows.
  • Log-Based Efficiency: Log-based CDC minimizes the impact on source systems by capturing changes directly from transaction logs. This reduces system resource usage and maintains source systems performance.
  • Zero-Downtime Migrations: Real-time data movement facilitated by CDC supports zero-downtime database migrations. It ensures up-to-date data availability for real-time analytics and reporting.
  • Synchronization Across Systems: CDC keeps data in multiple systems synchronized, crucial for time-sensitive decisions in high-velocity data environments.
  • Optimized for Cloud and Stream Processing: CDC efficiently moves data across wide area networks, making it ideal for cloud deployments and integrating data with stream processing solutions like Apache Kafka.
Diagram depicting change data capture

Quick Intro to CDC

Data is constantly changing, which can cause databases, data lakes, and data warehouses to be out of sync. Additionally, organizations are increasingly migrating to the cloud, which increases the chances of ending up with data silos. CDC has become a popular solution to bridge on-premises and cloud environments, allowing enterprises to migrate to the cloud at their own pace, or continue to operate in a hybrid environment.

See Confluent’s Oracle CDC Premium Connector in action here.

How Does Change Data Capture Work?

When data is changed through INSERT, UPDATE, or DELETE in a source database—which is usually a relational database such as MySQL, Microsoft SQL, Oracle, or PostgreSQL—it needs to be propagated to downstream systems such as caches, search indexes, data warehouses, and data lakes. Change data capture (CDC) tracks changes in a source dataset and transfers those changes to a target dataset.

Traditionally, teams used batch processing to sync data, which meant data was not synced right away, production databases were slowed down to allocate resources, and data replication only occurred during specified batch windows. CDC ensures that changes are synced in real-time, which eliminates the delays associated with traditional batch processing. CDC continuously tracks changes and updates target databases immediately, ensuring data is always up to date.

CDC is typically implemented using two main approaches: push and pull: Either the source database pushes the updates to downstream services and applications, or downstream services and applications poll the source database at fixed intervals to pull the updated data. Each approach has its own advantages and disadvantages. It’s important to consider all of these aspects in the context of your own use case.

Push vs. Pull

Push: In this approach, the source database does the heavy lifting. It captures changes in the database and sends those updates to target systems so they can take appropriate action.

The advantage of this method is that target systems are updated with the latest data in near real-time. However, if the target systems are unreachable or offline, the changed data can be lost.

To mitigate these risks, a messaging system is typically implemented between the source and target systems to buffer changes until they can be committed to their final destinations.

Pull: In this method, the source database’s job is lighter than the push method. Instead of actively sending updates, the source database logs data changes in a specific column on each table. It is the target systems’ responsibility to continuously poll the source database to retrieve the changes and take the right actions on them.

Like in the push method, a messaging system is required between the source and target systems, in order to ensure that the changed data isn’t lost when the target systems are unavailable.

The disadvantage of the pull approach is that target systems won’t be immediately notified if the data changes. Since the changes are batched between pull requests, there is a lag before the target systems will learn about these changes.

If your application needs real-time data, you should use the push method: it ensures the immediate propagation of changes. However, it requires robust messaging systems to handle the potential downtime of target systems.

If you want to continuously transfer big batches of data, and your applications don’t need real-time data, you should opt for the pull method. However, there will be a lag due to batch processing.

Change Data Capture Patterns and Methods

CDC detects changes in data using different methods. Following are the most commonly used methods:

Timestamp-based capture patterns

Timestamp-based: e most recent change. This column can be called LAST_MODIFIED, LAST_UPDATED, etc. Downstream applications or systems can query this field and get the records that have been updated since the last execution time.

Pros:

  • Simple to use and implement.
  • Provides a straighforward way to track changes over time.

Cons:

  • Can only process soft deletes and not DELETE operations.
  • Adds computational overhead to the source system, since target systems have to scan each row in the table to identify the last updated values.
  • Requires changes to the existing database schema.

An example is shown of a snapshot of the table when a new record is created and a snapshot of the table when the record with ID=101 is updated.

Trigger-based capture patterns

Trigger-based: Most databases support trigger functions. These are stored procedures that are automatically executed once a specific event—such as to INSERT, UPDATE, or DELETE a record—occurs on a table. One trigger for each operation per table is needed to capture any data changes. These data changes are stored in a separate table (commonly referred to as a “shadow table” or “event table”) in the same database. Additionally, developers can include messaging systems so that these data changes can be published to queues, where relevant target systems subscribe to them.

Pros:

  • Can detect and capture all types of changes (INSERT, UPDATE, and DELETE) to records.
  • Triggers are widely used and are supported by the majority of databases.
  • Enables real-time data capture without polling.

Cons:

  • Negatively impacts source database performance, since updating records requires multiple writes.
  • Requires changes to the source database schema.
  • Can become complex to manage with a large number of triggers.
Log-based capture patterns

Log-based: Transactional databases log all changes—INSERT, UPDATE, and DELETE— committed against the database and their corresponding timestamps into files called transaction logs. These logs are primarily used for backup and disaster-recovery purposes, but they can also be used to propagate changes to target systems. Data changes are captured in real time. Since target systems can read from transaction logs, this method doesn’t impose computational overhead on source databases.

Pros:

  • Doesn’t add computational overhead to source databases.
  • Can detect and capture all types of changes (INSERT, UPDATE, and DELETE) to records.
  • No need to change the schema of the source databases.

Cons:

  • There isn’t standardization around the format of transaction logs. This means that each vendor opts to implement their own methodology, which could change in future releases.
  • Target systems have to identify and eliminate any changes that were written to source databases but then rolled back.

Change Data Capture Use Cases

Change Data Capture (CDC) is used across modern data integration systems to ensure compliance and enable real-time analytics. By continuously capturing and replicating data changes, CDC supports seamless data migration, integration with microservices, and cloud adoption.

There are many use cases for CDC. Let’s review a few examples below.

Continuous Data Replication

When copying an entire source database to a destination system in batch mode, the source database cannot accept new writes, including schema changes, until the process is complete. The longer the copying process, the greater the risk of delaying important changes to the source. There is also a greater potential for further delays in passing the changes to the target.

Neither of these scenarios is acceptable, because consumers demand real-time experiences from modern applications. CDC addresses these challenges by continuously replicating changed data (a subset of the entire database) to downstream consumers.

Integration with Microservices Architecture

As organizations continue to break down their monolithic architectures and adopt microservices, they need to transfer data from source databases, and possibly direct it to more than one destination system. Since these transitions take time, CDC can be used to keep both source and target data stores synchronized during this process.

Cloud Adoption

Organizations are increasingly migrating to the cloud to reduce TCO and improve agility and elasticity. By leveraging cloud-native services, companies can focus on building new digital experiences rather than spending time and resources configuring, maintaining, and managing their databases and infrastructure.

CDC supports this migration and ensures that data remains consistent and up-to-date across on-premises and cloud environments. This seamless data integration helps businesses fully leverage cloud capabilities without disruption.

Change Data Capture and Apache Kafka

CDC allows you to capture data changes in source databases, but you still need to communicate and propagate these changes to downstream systems. This is where Apache Kafka and Kafka Connect shine. Apache Kafka is an open-source event streaming platform that can durably write and store streams of events, and process them in real-time or retrospectively. Kafka is a distributed system of servers and clients that provides reliable and scalable performance.

The Kafka Connect API is a core component of Apache Kafka, introduced in version 0.9. Kafka Connect provides scalable and resilient integration for Kafka with other systems, both sending to and receiving data from them. Kafka Connect is configuration-driven, meaning that you don’t need to write any code to use it. It is driven purely by configuration files and provides an easy integration point for developers.

One of the popular use cases for Kafka Connect is database change data capture. You can leverage Confluent’s JDBC or Debezium CDC connectors to integrate Kafka with your database, and easily stream your data into Confluent.

The JDBC connector polls source databases for new or changed data based on update timestamp columns. Confluent Cloud offers fully-managed source connectors for Microsoft SQL Server, PostgreSQL, MySQL, and Oracle.

Additionally, Confluent offers log-based CDC source connectors for Microsoft SQL Server, PostgreSQL, MySQL, and Oracle. These connectors obtain a snapshot of existing data at startup time, and then monitor and record all subsequent row-level changes.

Learn how to build streaming data pipelines with Apache Kafka and Confluent with this free on-demand course.

Why Confluent?

Confluent’s fully-managed CDC source connectors for MySQL, Oracle, Microsoft SQL Server, and PostgreSQL simplify streaming data pipelines. These connectors help you easily migrate data across different environments, meeting you everywhere your data systems live. Our rich ecosystem of 120+ pre-built source and sink connectors makes it easy to connect multiple data systems, and keep them in sync throughout the migration. Furthermore, you can reduce your total cost of ownership using our fully managed, cloud-native services with no operational overhead or prohibitive licensing costs.

Learn More About CDC

Stream to cloud databases with Confluent

Try the CDC demo from our GitHub repo

Stream from on-premises Oracle to cloud-based MongoDB

Build streaming data pipelines to cloud data warehouses

Streaming data pipelines course by Confluent developers

Learn how Confluent Cloud can help reduce TCO by up to 60%

Streaming data pipelines