Change data capture (CDC) refers to the tracking of all changes in a data source (databases, data warehouses, etc.) so they can be captured in destination systems. In short, CDC allows organizations to achieve data integrity and consistency across all systems and deployment environments. Additionally, it allows organizations to use the right tool for the right job by moving data from legacy databases to purpose-built data platforms, such as document or search databases and data warehouses.
Initially, CDC became popular as an alternative solution to batch data replication for populating data warehouses for Extract Transform Load (ETL) jobs. In recent years, CDC has become the de facto method for migrating to the cloud.
Data is the lifeblood of every organization, so it’s essential to ensure that business decisions are based on conflict-free and real-time data. The challenge is that data is constantly changing, which could cause databases, data lakes, and data warehouses to be out of sync. Additionally, organizations are increasingly migrating to the cloud, which increases the chances of ending up with data silos. This has led to CDC becoming a popular solution to bridge the on-premises and cloud environments and let enterprises either migrate to the cloud at their own pace or continue to operate in a hybrid environment.
See Confluent’s Oracle CDC Premium Connector in action here.
When data is changed (through INSERT, UPDATE, or DELETE) in the source database—which is usually a relational database such as MySQL, Microsoft SQL, Oracle, or PostgreSQL—it needs to be propagated to downstream systems such as a cache, search index, data warehouse, or data lake.
Broadly speaking, there are two flavors of CDC: push and pull.
Either the source database pushes the updates to downstream services and applications, or downstream services and applications poll the source database at fixed intervals to pull the updated data.
Each flavor comes with its own advantages and disadvantages, some of which you can read about below. It’s important to consider all of these aspects in the context of your own use case.
Push: In this approach, the source database does the heavy lifting. It implements logic and processes to capture changes in the database, then sends those updates to target systems so they can take appropriate action. The advantage of this method is that target systems will be updated with the latest data in near real time. The downside is that changed data will be lost if the target systems are unreachable or offline. To remedy that, a messaging system is typically implemented between source and target systems so that changes aren’t lost until they are committed to their final destinations.
Pull: In this method, the source database’s job is lighter than in the push method, since it only logs the data changes in a column on each table. It is the target systems’ responsibility to continuously poll the source database to retrieve the changes and take the right actions on them. Just like in the push method, a messaging system is needed between the source and target systems to ensure that the changed data isn’t lost when the target systems are unavailable. The disadvantage of the pull approach is that target systems won’t be immediately notified if the data changes. Since the changes are batched between the pull requests, there will be a lag before the target systems learn about these changes.
To summarize, if your application needs real-time data, you should use the push method. If you want to continuously transfer big batches of data and your applications don’t need real-time data, you should opt for the pull method.
CDC detects changes in data using different methods. Following are the most commonly used methods:
Timestamp-based: Database designers can add a column to the schema of tables that reflects the time of the most recent change. This column can be called LAST_MODIFIED, LAST_UPDATED, etc. Downstream applications or systems can query this field and get the records that have been updated since the last execution time.
An example is shown of a snapshot of the table when a new record is created and a snapshot of the table when the record with ID=101 is updated.
Trigger-based: Most databases support trigger functions. These are stored procedures that are automatically executed once a specific event—such as to INSERT, UPDATE, or DELETE a record—occurs on a table. One trigger for each operation per table is needed to capture any data changes. These data changes are stored in a separate table (which is commonly referred to as a shadow table or event table) in the same database. Additionally, developers can include messaging systems so that these data changes can be published to queues where relevant target systems subscribe to them.
Log-based: Transactional databases log all changes—INSERT, UPDATE, and DELETE— committed against the database and their corresponding timestamps into files called transaction logs. These logs are primarily used for backup and disaster-recovery purposes, but they can also be used to propagate changes to target systems. Data changes are captured in real time. Since target systems can read from transaction logs, this method doesn’t impose computational overhead on source databases.
There are many use cases for CDC. Let’s review a few examples below.
When copying an entire source database to a destination system in batch mode, the source database cannot accept new writes, including schema changes, until the process is complete. The longer the copying process, the greater the risk of delaying important changes to the source. There is also a greater potential for further delays in passing the changes to the target.
Neither of these scenarios are acceptable in today’s digital world, where consumers demand real-time experiences from modern applications. CDC addresses these challenges by continuously replicating changed data (which is a subset of the entire database) to downstream consumers.
As organizations continue to break down their monolithic architecture and adopt microservices, they need to transfer data from source databases and possibly direct it to more than one destination system. Since these transitions take time, CDC can be used to keep both source and target data stores synchronized during this process.
Organizations are increasingly migrating to the cloud in an effort to reduce TCO and improve agility and elasticity. By leveraging cloud-native services, companies can focus on building new digital experiences rather than spending time and resources configuring, maintaining, and managing their databases and infrastructure.
CDC allows you to capture data changes in source databases, but you still need to communicate and propagate these changes to downstream systems. This is where Apache Kafka and Kafka Connect shine. Apache Kafka is an open-source event streaming platform. It provides the ability to durably write and store streams of events and process them, either in real time or retrospectively. Kafka is a distributed system of servers and clients that provides reliable and scalable performance.
The Kafka Connect API is a core component of Apache Kafka, introduced in version 0.9. Kafka Connect provides scalable and resilient integration for Kafka with other systems, both sending to and receiving data from them. Kafka Connect is configuration-driven, meaning that you don’t need to write any code to use it. It is driven purely by configuration files and therefore provides an easy integration point for developers.
One of the popular use cases for Kafka Connect is database change data capture. You can leverage Confluent’s JDBC or Debezium CDC connectors to integrate Kafka with your database and easily stream your data into Confluent.
JDBC connector polls source databases for new or changed data based on update timestamp column. Confluent Cloud offers fully-managed source connectors for Microsoft SQL Server, PostgreSQL, MySQL, and Oracle.
Furthermore, Confluent offers log-based CDC source connectors for Microsoft SQL Server, PostgreSQL, MySQL, and Oracle. These operate by obtaining a snapshot of existing data at the startup time, then monitoring and recording all subsequent row-level changes to that data.
Learn how to build streaming data pipelines with Apache Kafka and Confluent with this free on-demand course.
Confluent’s fully-managed CDC source connectors for MySQL, Oracle, Microsoft SQL Server, and PostgreSQL simplify streaming data pipelines. These connectors help you easily migrate data across different environments, meeting you everywhere your data systems live. Our rich ecosystem of 120+ pre-built source and sink connectors makes it easy to connect multiple data systems and keep them in sync throughout the migration. Furthermore, you can reduce your total cost of ownership using our fully managed, cloud-native services with no operational overhead or prohibitive licensing costs.