Implementing Change Data Capture in MS Databases

Seralahthan
10 min readNov 26, 2018

Overview of Change Data

In Data warehouses environment capturing and preserving the state of data across time is one of the core functionality. In the Data warehouses, a timely snapshot of database tables is saved.

In order to do that we need to track down the changes done to the source tables. There are two ways in SQL server that we can use to track down the changes to the source tables.

  • Change Data Tracking (CT)
  • Change Data Capture (CDC)

Change Data Capture (CDC) and Change Tracking (CT) were both introduced in SQL Server 2008 for data tracking. While CDC was only for Enterprise Edition, CT was available for all editions of SQL Server 2008.

SQL Server Change Tracking

SQL Server Change Tracking is the simplest tool for capturing changes to the source tables. CT allows the net changes made to data to be easily returned from a query. This only lets you know that a particular row has changed since your last query.

Change Tracking only provides the following facts

  • Has a row changed or not
  • Latest data of the row can be obtained directly from the table that is being tracked.

It doesn’t provide information about

  • How many times it’s changed
  • The various change values over time if there are intermediate change values they get lost
  • There are possibilities of data getting changed while the time we obtain the latest data from the tracked table. Ideally, the fact that a row has changed and information about the change must be available and recorded at the time that the change was made in the same transaction.

The queries you run will return a table that lets you know which rows have changed since the last check, and then let you know the type of DML change. You need to join this table with the source table to actually get the data.

This is really useful for those applications that cache data and periodically query to update their caches.

Change Data Capture (CDC)

SQL server CDC mechanism captures and records the insert, update, and delete operations performed on the source SQL server table. The details of the changes available in the change tables in an easily consumed relational format.

Metadata and column information required to apply the changes to a target environment is also captured for the modified rows of the source table and stored in the change tables. Table-valued functions are provided to allow systematic access to the change data by consumers.

A good example of a data consumer used by this technology is an extraction, transformation, and loading (ETL) application. ETL application incrementally loads the changes to data from SQL Server source tables to a data warehouse or data mart. Even though we have a requirement of reflecting the changes of the source tables in the data warehouse, it not good to have the replica of source table in the marts. To can be achieved by the SQL Server change data capture technology.

The following illustration shows the principal data flow for the change data capture

Figure 1.0, Source:
https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-2017

When a data manipulation operations like insert, update or delete is applied to the tracked source tables, the relevant entries that describes those changes are added to the log. These transactional logs serve as the input to the capture process. SQL Server’s capture process reads the logs and adds information about changes to the tracked table’s to the associated change table.

SQL Server Edition with Change Data Capture

I have installed the MS SQL Server 2017 Developer Edition on Ubuntu 18.04 environment. Make sure to select the correct edition of SQL Server which supports Change Data Capture feature.

SQL Server’s “Web” and “Express” editions don’t support Change Data Capture. “Enterprise”, “Standard” and “Developer” Editions support Change Data Capture.

Enabling the Change Data Capture for SQL Database

In order to use Change Data Capture functionality provided by the SQL Server, it should be enabled for the particular database which we are planning to track the changes.

To create capture instance for individual tables, a user with “sysadmin” privilege server role must enable the database for change data capture. This is done by running the stored procedure “sys.sp_cdc_enable_db” (Transact-SQL) in the database.

I have a system admin (sysadmin) user “SA” with password “abcd@1234”
Log in as the sysadmin and create a database “TestCDC”.

sqlcmd -S localhost -U SA -P 'abcd@1234'
CREATE DATABASE TestCDC
GO

Now let’s enable the change data capture for the “TestCDC” database by executing the “sys.sp_cdc_enable_db” stored procedure.

USE TestCDC
GO
EXEC sys.sp_cdc_enable_db
GO

Upon the successful execution of the stored procedure, the change data capture will be enabled on the “TestCDC” database. We can verify it by browsing the database entries.

Let’s connect to the database using “DBeaver Client”

Figure 1.1
Figure 1.2
Figure 1.3
Figure 1.4

Now we have connected to the TestCDC database and we have “cdc”, “dbo” and “sys” domains holding separate tables.

Figure 1.5

To verify that the change data capture is enabled execute the following command select query on the “sys.databases” with database name “TestCDC”.

SELECT is_cdc_enabled, * FROM sys.databases WHERE name='TestCDC'
Figure 1.6

The query returns “is_cdc_enabled” column with “1”, which means the change data capture is enabled for the database.

If the change data capture is not enabled it will return the “is_cdc_enabled” column with “0”.

Enabling the change data capture in the “TestCDC” database will create the following tables under the “cdc” and “dbo”.

Figure 1.7

cdc.captured_columns — Stores metadata for tracked table’s columns
cdc.change_tables — Stores metadata for tracked tables
cdc.ddl_history — Stores history of all the DDL changes since capture data enabled.
cdc.index_columns — Stores indexes associated with change table.
cdc.lsn_time_mapping — This table maps LSN number and time. Used for calculating clean-up time

Figure 1.8

dbo.systranschemas — Tracks schema changes.

If you try to execute the “sys.sp_cdc_enable_db” stored procedure on a already change data capture enabled database. The following error message will be thrown.

Figure 1.9

Enabling the Change Data Capture for SQL Database Tables

Now that we have enabled Change Data Capture to the database, let’s create a table “Employee” and insert some data into the table.

Create an “Employee” table with auto increment “EmployeeID” as the Primary Key. SQL Server uses “IDENTITY” keyword to specify auto incrementing.

USE TestCDC
GO
CREATE TABLE dbo.Employee (
EmployeeID INT IDENTITY PRIMARY KEY,
FirstName VARCHAR(100),
LastName VARCHAR(100),
Position VARCHAR(100),
PayScale DECIMAL
)
GO

Insert some values into the “Employee” table.

INSERT INTO dbo.Employee(FirstName, LastName, Position, PayScale)
VALUES
('User1', 'Test1', 'Software Engineer', 150000),
('User2', 'Test2', 'Quality Assuarance Engineer', 120000),
('USer3', 'Test3', 'Business Analyst', 250000),
('USer4', 'Test4', 'Systems Engineer', 150000),
('USer5', 'Test5', 'Project Manager', 200000)
GO

Enable Change Data Capture for a specified table in the current database

To track the changes to data when executing the DML operations we need to enable Change Data Capture to the relevant tables.

To enable CDC for the required table execute the “sys.sp_cdc_enable_table” system stored procedure. When CDC is enabled for a table, a CDC table (table which keeps track of history of changes along with meta-data about changes) and two table-valued functions are generated along with capture and cleanup jobs for the database if this is the first table in the database to be enabled for CDC.

Verify if the table is enabled for CDC by looking into the “is_tracked_by_cdc” column of the sys.tables catalog view.

Let’s execute the “sys.sp_cdc_enable_table” stored procedure to enable change data capture on the “Employee” table.

EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name = N'Employee',
@role_name = N'UserRole',
@supports_net_changes = 1
GO

Let’s see what are these parameters mean.

@source_schema — Is the name of the schema in which the source table belongs.

@source_name — Name of the source table.

@role_name — Database role used to gate access to changed data. If the role already exists it is used or else an attempt is made to create a database role with this name.

@supports_net_changes — This supports for querying for net changes. This means all changes that happen on a record will be summarized in the form of net change. By default its value is 1 if the table has a primary key or the table has a unique index that has been specified with @index_name parameter otherwise, the default value is 0.

Let’s check whether the “Employee” table is being successfully tracked by the Change Data Capture process. This can be checked by confirmed by the “is_tracked_by_cdc” parameter of the “Employee” table. “is_tracked_by_cdc” with “1” indicates the table is being successfully tracked.

Figure 2.0

If you execute the “sys.sp_cdc_enable_table” stored procedure on an already tracked table. The following error will be thrown.

Figure 2.1

Now let’s execute some DML statements against the CDC enabled “Employee” table.

INSERT INTO dbo.Employee(FirstName, LastName, Position, PayScale)
VALUES('user6', 'Test6', 'Senior Software Engineer', 235000)
GO
DELETE FROM dbo.Employee
WHERE EmployeeID = 2
GO
UPDATE dbo.Employee
SET PayScale = 275000, FirstName = 'MUser3'
WHERE EmployeeID = 3
GO
UPDATE dbo.Employee
SET PayScale = 300000
WHERE EmployeeID = 3
GO

When we enable CDC for a table, SQL Server creates the table
“cdc.<capture instance>_CT” and keeps recording DML changes happening to the tracked table in this table.

In our example, the result-set of changes captured for the “Employee” table is stored in the “dbo_Employee_CT” table.

Figure 2.2

Let’s view the result-set of changes by executing a SELECT query on this table.
SELECT * FROM [cdc].[dbo_Employee_CT]

Figure 2.3

In the change table there some additional columns like _$operation, _$update_mask which capture meta information about the changes.

“_$operation” — Captures the DML operation that need to be applied to the row of change data to be transformed to the target data source. “1= delete, 2 = insert, 3 = value before update and 4 = value after update”.

“ _$update_mask”“Bit” mask representing columns that were changed during the DML operations. It means “delete (__$operation = 1) and insert (__$operation = 2) operation will have value set to 1 for all defined bits” whereas for “update (__$operation = 3 and __$operation = 4) only those bits corresponding to columns that changed are set to 1”.

SQL Server’s CDC also provides the following two functions.

  • cdc.fn_cdc_get_all_changes_<capture_instance>
  • cdc.fn_cdc_get_net_changes_<capture_instance>

“cdc.fn_cdc_get_all_changes” function is used for viewing multiple changes during the specified range interval, each change is represented in the results. Function returns one row for each change applied to the CDC tracked table within the specified log sequence number (LSN) range.

“cdc.fn_cdc_get_net_changes” returns one net change row for each source row changed within the specified LSN range. That is when a source row has multiple changes during the specified LSN range, a single row that reflects the final content of the row is returned.

In our example we have done to updates of EmployeeID = 3. Let’s view all the changes using the “cdc.fn_cdc_get_all_changes” function.

ECLARE @MinimumLSN binary(10), @MaximumLSN binary(10)
SET
@MinimumLSN = sys.fn_cdc_get_min_lsn('dbo_Employee')
SET
@MaximumLSN = sys.fn_cdc_get_max_lsn()
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee (
@MinimumLSN, @MaximumLSN, N'all');
GO
Figure 2.4

We have 4 different rows for each DML operations performed. Returns two rows for the update of the EmployeeID=3.

Now let’s execute the “cdc.fn_cdc_get_net_changes” function.

DECLARE @MinimumLSN binary(10), @MaximumLSN binary(10)
SET
@MinimumLSN = sys.fn_cdc_get_min_lsn('dbo_Employee')
SET
@MaximumLSN = sys.fn_cdc_get_max_lsn()
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_Employee (
@MinimumLSN, @MaximumLSN, N'all');
GO
Figure 2.5

Here we have only the net or final changes for EmployeeID = 3 in a single record.

The functions above expect starting LSN and ending LSN and returns the changes between these two LSNs. You can use “sys.fn_cdc_get_min_lsn” to get start LSN for the specified capture instance and “sys.fn_cdc_get_max_lsn” to get ending LSN from “cdc.lsn_time_mapping” system table.

In some use cases, you might need to pull data based on a time range instead of LSN range and hence you can use the “sys.fn_cdc_map_time_to_lsn” function to get start LSN from “cdc.lsn_time_mapping” system table for the specified time.

CDC Jobs and Cleanup Process

In a CDC enabled database, there will be two jobs created. The first job captures the information from the SQL Server transaction log as it works in an asynchronous manner whereas the second job cleans up the tracked table. The cleanup process occurs every three days by default (this is configurable and can be changed as per specific needs).

Try this out and feel free to comment if you encounter any issue in running the change data capture task.

References:
1. https://en.wikipedia.org/wiki/Change_data_capture

2. http://www.sqlservercentral.com/blogs/steve_jones/2012/05/10/the-difference-between-change-tracking-and-change-data-capture/

3. https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-2017

4. https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sys-sp-cdc-enable-table-transact-sql?view=sql-server-2017

--

--

Seralahthan

Consultant - Integration & CIAM | ATL@WSO2 | BScEng(Hons) in Computer Engineering | Interested in BigData, ML & AI