국내 No.1 에너지 IT기업 ‘해줌’의 컨플루언트 클라우드 도입 스토리 | 알아보고 등록하기

What is Change Data Capture?

Change data capture (CDC) refers to the process of tracking all changes in data sources, such as databases and data warehouses, so they can be captured in destination systems. CDC methods like transaction logs and database triggers allow organizations to achieve data integrity and consistency across multiple systems and deployment environments. Additionally, they allow organizations to use the right tool for the right job by moving data from legacy databases to purpose-built data platforms, such as document or search databases, or data warehouses.

Why Change Data Capture?

Initially, CDC became popular as an alternative solution to batch data replication for populating data warehouses for Extract Transform Load (ETL) jobs. In recent years, CDC has become the de facto method for migrating to the cloud.

Here are some of the main benefits of CDC:

  • Eliminates Bulk Load Updates: CDC enables incremental loading or real-time streaming of data changes into the target repository. There is no need for bulk load updates and inconvenient batch windows.
  • Log-Based Efficiency: Log-based CDC minimizes the impact on source systems by capturing changes directly from transaction logs. This reduces system resource usage and maintains source systems performance.
  • Zero-Downtime Migrations: Real-time data movement facilitated by CDC supports zero-downtime database migrations. It ensures up-to-date data availability for real-time analytics and reporting.
  • Synchronization Across Systems: CDC keeps data in multiple systems synchronized, crucial for time-sensitive decisions in high-velocity data environments.
  • Optimized for Cloud and Stream Processing: CDC efficiently moves data across wide area networks, making it ideal for cloud deployments and integrating data with stream processing solutions like Apache Kafka.
Diagram depicting change data capture

Quick Intro to CDC

Data is constantly changing, which can cause databases, data lakes, and data warehouses to be out of sync. Additionally, organizations are increasingly migrating to the cloud, which increases the chances of ending up with data silos. CDC has become a popular solution to bridge on-premises and cloud environments, allowing enterprises to migrate to the cloud at their own pace, or continue to operate in a hybrid environment.

See Confluent’s Oracle CDC Premium Connector in action here.

How Does Change Data Capture Work?

When data is changed through INSERT, UPDATE, or DELETE in a source database—which is usually a relational database such as MySQL, Microsoft SQL, Oracle, or PostgreSQL—it needs to be propagated to downstream systems such as caches, search indexes, data warehouses, and data lakes. Change data capture (CDC) tracks changes in a source dataset and transfers those changes to a target dataset.

Traditionally, teams used batch processing to sync data, which meant data was not synced right away, production databases were slowed down to allocate resources, and data replication only occurred during specified batch windows. CDC ensures that changes are synced in real-time, which eliminates the delays associated with traditional batch processing. CDC continuously tracks changes and updates target databases immediately, ensuring data is always up to date.

CDC is typically implemented using two main approaches: push and pull: Either the source database pushes the updates to downstream services and applications, or downstream services and applications poll the source database at fixed intervals to pull the updated data. Each approach has its own advantages and disadvantages. It’s important to consider all of these aspects in the context of your own use case.

Push vs. Pull

Push: In this approach, the source database does the heavy lifting. It captures changes in the database and sends those updates to target systems so they can take appropriate action.

The advantage of this method is that target systems are updated with the latest data in near real-time. However, if the target systems are unreachable or offline, the changed data can be lost.

To mitigate these risks, a messaging system is typically implemented between the source and target systems to buffer changes until they can be committed to their final destinations.

Pull: In this method, the source database’s job is lighter than the push method. Instead of actively sending updates, the source database logs data changes in a specific column on each table. It is the target systems’ responsibility to continuously poll the source database to retrieve the changes and take the right actions on them.

Like in the push method, a messaging system is required between the source and target systems, in order to ensure that the changed data isn’t lost when the target systems are unavailable.

The disadvantage of the pull approach is that target systems won’t be immediately notified if the data changes. Since the changes are batched between pull requests, there is a lag before the target systems will learn about these changes.

If your application needs real-time data, you should use the push method: it ensures the immediate propagation of changes. However, it requires robust messaging systems to handle the potential downtime of target systems.

If you want to continuously transfer big batches of data, and your applications don’t need real-time data, you should opt for the pull method. However, there will be a lag due to batch processing.

Change Data Capture Patterns and Methods

CDC detects changes in data using different methods. Following are the most commonly used methods:

Timestamp-based capture patterns

Timestamp-based: e most recent change. This column can be called LAST_MODIFIED, LAST_UPDATED, etc. Downstream applications or systems can query this field and get the records that have been updated since the last execution time.

Pros:

  • Simple to use and implement.
  • Provides a straighforward way to track changes over time.

Cons:

  • Can only process soft deletes and not DELETE operations.
  • Adds computational overhead to the source system, since target systems have to scan each row in the table to identify the last updated values.
  • Requires changes to the existing database schema.

An example is shown of a snapshot of the table when a new record is created and a snapshot of the table when the record with ID=101 is updated.

Trigger-based capture patterns

Trigger-based: Most databases support trigger functions. These are stored procedures that are automatically executed once a specific event—such as to INSERT, UPDATE, or DELETE a record—occurs on a table. One trigger for each operation per table is needed to capture any data changes. These data changes are stored in a separate table (commonly referred to as a “shadow table” or “event table”) in the same database. Additionally, developers can include messaging systems so that these data changes can be published to queues, where relevant target systems subscribe to them.

Pros:

  • Can detect and capture all types of changes (INSERT, UPDATE, and DELETE) to records.
  • Triggers are widely used and are supported by the majority of databases.
  • Enables real-time data capture without polling.

Cons:

  • Negatively impacts source database performance, since updating records requires multiple writes.
  • Requires changes to the source database schema.
  • Can become complex to manage with a large number of triggers.
Log-based capture patterns

Log-based: Transactional databases log all changes—INSERT, UPDATE, and DELETE— committed against the database and their corresponding timestamps into files called transaction logs. These logs are primarily used for backup and disaster-recovery purposes, but they can also be used to propagate changes to target systems. Data changes are captured in real time. Since target systems can read from transaction logs, this method doesn’t impose computational overhead on source databases.

Pros:

  • Doesn