Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
Building cross-platform solutions enables organizations to leverage technology driven by real-time data and enabled with both highly available services and low-latency databases hosted on Microsoft Azure.
Azure Cosmos DB is a fully managed NoSQL database offering with financially backed latency and availability SLAs, enabling apps to run at scale. Confluent is extensively used in data pipelines, serving as a distributed platform for data in motion and providing a seamless solution that connects large-scale applications and microservices running on Azure Cosmos DB.
Confluent provides a platform for data in motion that enables organizations to enrich and transform data as it flows from legacy and/or on-prem systems into new systems on Azure. Confluent on Azure allows for more seamless integration into Azure’s broader ecosystem.
Confluent and Microsoft’s Commercial Software Engineering group have worked together to build a self-managed connector. The Azure Cosmos DB Connector provides a new integration capability between Azure Cosmos DB and Confluent Platform. Microsoft provides enterprise support for the Azure Cosmos DB Connector.
As companies adopt the cloud, they may discover that migrating to the cloud is not a simple, one-time project—it’s a much harder task than building new cloud-native applications. Keeping the old legacy stack and the new cloud applications in sync, with a single cohesive global information system is critical.
Confluent enables large scale, big data pipelines that automate real-time data movement across any systems, applications, and architectures at massive scale. Aggregate, transform, and move data from on-premises legacy services, private clouds, or public clouds and into your apps from a central data pipeline for powerful insights and analytics.
This demo shows how to populate test data from the Datagen Source Connector to Azure Cosmos DB using Confluent and the newly available self-managed Azure Cosmos DB connector.
There are a number of ways to get started with Confluent Platform. For this tutorial, we are going to use the cp-all-in-one Docker image, which installs all components of the Confluent Platform. You can walk through the Docker quick start by following the documentation. You can also install it on an Azure virtual machine (VM).
Let’s copy down the
curl --silent --output docker-compose.yml
For more information, see GitHub.
We can start the Docker containers with this:
docker-compose up -d
And very quickly check the status of our containers with this:
If you can see that the state for all of your containers is “up”, we are good to go!
While we are here, we can go ahead and install the files needed for the Azure Cosmos DB connector. On the Connect instance, we can run this:
confluent-hub install microsoftcorporation/kafka-connect-cosmos:1.0.4-beta
After installation, the connector will need to be restarted. You can run Docker Compose to restart the connector.
Now that all of our services are up, we can go back into Docker and open our containers. Open your terminal > Go to your Docker folder > Open up your set of containers. We will be interacting with the following components: Control Center, ksql-datagen, Kafka Connect, the broker, and ZooKeeper. We can hover over the “control-center” container and open up the Control Center UI in the browser.
The browser will open up to the url http://localhost:9021/clusters/, which will show the Control Center UI. Control Center delivers a UI for Apache Kafka®, allowing developers to manage messages, topics, schemas, and Kafka Connect connectors. We haven’t created any topics yet or sent any messages yet, so the UI will show zero clusters, brokers, and topics.
Confluent Control Center will now show that the Azure Cosmos DB connector is available.
Now we can set up our source and sinks. But first, let’s set up our topic and make sure that the schema is in line with our Azure Cosmos DB database.
We will go over topic mapping a little later, but for now, it is important that you review the schema of your database and match the data types in your schema to your database.
If we make a mistake with the schema, we need to delete the topic. We can delete a topic by running
kafka-topics –delete –zookeeper localhost:2181 –topic <topicname>. You will receive back a prompt saying that the topic
<topicname> is marked for deletion. Note: This will have no impact if
delete.topic.enable is not set to
true. Of course, use this command at your own risk!
To define the mapping between Kafka topics and the Azure Cosmos DB containers, the sink and source connectors that work with the
topic-container map configuration will be used. This config is provided as a string of the
topic-container pairs (joined with “#”) that are comma separated. For instance, to map the topic
mytopic1 to the Cosmos container
myctr1 and the topic
test-topic with the container
test-ctr, you would specify it as shown here:
There are a few things to note about the topic mapping. Each topic and container name can be used only once in the overall mapping. Since the Azure Cosmos DB sink and source connectors both monitor a single Cosmos database at once, the specified Cosmos containers must be present in this database.
A single Cosmos sink connector task can work with multiple topic-container pairs. On the other hand, you will need to specify the same number of Cosmos source connector tasks with as many topic-container pairs. For the example above, since there are two pairs, you will need to create a source connector with
tasks.max set to two.
Now that we have successfully standardized our data elements flowing from the Datagen Source Connector to Confluent, then from Confluent to the Azure Cosmos DB. Let’s go ahead and create our connectors from within Confluent Control Center.
Log in to Confluent. Click on the Connectors link on the left menu. Select the Datagen Source Connector (you can also use the search box). Fill in the details as follows:
Schema.String: <fill in the schema string from your DataGen>
Check out the Kafka Connect Datagen repo on GitHub for additional code samples.
Select the CosmosDB Sink Connector (you can also use the search box). Fill in the details as follows:
<insert your Azure Cosmos DB Endpoint here>
<insert your Azure Cosmos DB Endpoint here>
Then hit Continue. Check out the Microsoft CosmosDB repo on Github for additional code samples.
Once the connector is running, we can observe the corresponding topics created for tables in the broker through Confluent Control Center by logging to it http://localhost:9021/clusters > Cluster > Topics. Now you can log in to your Azure Portal account and query the data with Data Explorer.
Watch the new data getting ingested into the target in real time by selecting Refresh and check out the Microsoft docs for more information on Data Explorer, Azure Cosmos DB, and Confluent Cloud on Azure.
Want to know more?
CL60BLOGfor an additional $60 of free Confluent Cloud usage*
Use the Confluent CLI and API to create Stream Designer pipelines from SQL source code.
This post details how to minimize internal messaging within Confluent platform clusters. Service mesh and containerized applications have popularized the idea of control and data planes. This post applies it to the Confluent platform clusters and highlights its use in Confluent Cloud.