Change data capture (CDC) refers to the process of tracking all changes in data sources, such as databases and data warehouses, so they can be captured in destination systems. CDC methods like transaction logs and database triggers allow organizations to achieve data integrity and consistency across multiple systems and deployment environments. Additionally, they allow organizations to use the right tool for the right job by moving data from legacy databases to purpose-built data platforms, such as document or search databases, or data warehouses.
Initially, CDC became popular as an alternative solution to batch data replication for populating data warehouses for Extract Transform Load (ETL) jobs. In recent years, CDC has become the de facto method for migrating to the cloud.
Here are some of the main benefits of CDC:
Data is constantly changing, which can cause databases, data lakes, and data warehouses to be out of sync. Additionally, organizations are increasingly migrating to the cloud, which increases the chances of ending up with data silos. CDC has become a popular solution to bridge on-premises and cloud environments, allowing enterprises to migrate to the cloud at their own pace, or continue to operate in a hybrid environment.
See Confluent’s Oracle CDC Premium Connector in action here.
When data is changed through INSERT, UPDATE, or DELETE in a source database—which is usually a relational database such as MySQL, Microsoft SQL, Oracle, or PostgreSQL—it needs to be propagated to downstream systems such as caches, search indexes, data warehouses, and data lakes. Change data capture (CDC) tracks changes in a source dataset and transfers those changes to a target dataset.
Traditionally, teams used batch processing to sync data, which meant data was not synced right away, production databases were slowed down to allocate resources, and data replication only occurred during specified batch windows. CDC ensures that changes are synced in real-time, which eliminates the delays associated with traditional batch processing. CDC continuously tracks changes and updates target databases immediately, ensuring data is always up to date.
CDC is typically implemented using two main approaches: push and pull: Either the source database pushes the updates to downstream services and applications, or downstream services and applications poll the source database at fixed intervals to pull the updated data. Each approach has its own advantages and disadvantages. It’s important to consider all of these aspects in the context of your own use case.
Push: In this approach, the source database does the heavy lifting. It captures changes in the database and sends those updates to target systems so they can take appropriate action.
The advantage of this method is that target systems are updated with the latest data in near real-time. However, if the target systems are unreachable or offline, the changed data can be lost.
To mitigate these risks, a messaging system is typically implemented between the source and target systems to buffer changes until they can be committed to their final destinations.
Pull: In this method, the source database’s job is lighter than the push method. Instead of actively sending updates, the source database logs data changes in a specific column on each table. It is the target systems’ responsibility to continuously poll the source database to retrieve the changes and take the right actions on them.
Like in the push method, a messaging system is required between the source and target systems, in order to ensure that the changed data isn’t lost when the target systems are unavailable.
The disadvantage of the pull approach is that target systems won’t be immediately notified if the data changes. Since the changes are batched between pull requests, there is a lag before the target systems will learn about these changes.
If your application needs real-time data, you should use the push method: it ensures the immediate propagation of changes. However, it requires robust messaging systems to handle the potential downtime of target systems.
If you want to continuously transfer big batches of data, and your applications don’t need real-time data, you should opt for the pull method. However, there will be a lag due to batch processing.
CDC detects changes in data using different methods. Following are the most commonly used methods:
Timestamp-based: e most recent change. This column can be called LAST_MODIFIED, LAST_UPDATED, etc. Downstream applications or systems can query this field and get the records that have been updated since the last execution time.
Pros:
Cons:
An example is shown of a snapshot of the table when a new record is created and a snapshot of the table when the record with ID=101 is updated.
Trigger-based: Most databases support trigger functions. These are stored procedures that are automatically executed once a specific event—such as to INSERT, UPDATE, or DELETE a record—occurs on a table. One trigger for each operation per table is needed to capture any data changes. These data changes are stored in a separate table (commonly referred to as a “shadow table” or “event table”) in the same database. Additionally, developers can include messaging systems so that these data changes can be published to queues, where relevant target systems subscribe to them.
Pros:
Cons:
Log-based: Transactional databases log all changes—INSERT, UPDATE, and DELETE— committed against the database and their corresponding timestamps into files called transaction logs. These logs are primarily used for backup and disaster-recovery purposes, but they can also be used to propagate changes to target systems. Data changes are captured in real time. Since target systems can read from transaction logs, this method doesn’t impose computational overhead on source databases.
Pros: