Setting up proactive, synthetic monitoring is critical for complex, distributed systems like Apache Kafka®, especially when deployed on Kubernetes and where the end-user experience is concerned, and is paramount for healthy real-time data pipelines. A key benefit for operations teams running Kafka on Kubernetes is infrastructure abstraction: it can be configured once and run everywhere.
When integrated with Confluent Platform, Datadog can help visualize the performance of the Kafka cluster in real time and also correlate the performance of Kafka with the rest of your applications.
While Confluent recommends our customers use Confluent Cloud to monitor clusters for you, there are situations where you may need to self-host a Confluent Platform cluster on a cloud provider or on premises.
This blog post shows you how you can get more comprehensive visibility into your deployed Confluent Platform using Confluent for Kubernetes (CFK) on Amazon Kubernetes Service (AWS EKS), by collecting all Kafka telemetry data in one place and tracking it over time using Datadog.
Confluent for Kubernetes (CFK) is a cloud-native control plane for deploying and managing Confluent in your private cloud environment. It provides a standard and simple interface to customize, deploy, and manage Confluent Platform through a declarative API.
Datadog is a monitoring and analytics tool for IT and DevOps teams that can be used to determine performance metrics as well as event monitoring for infrastructure and cloud services. It can monitor services such as servers, databases, cloud infrastructure, system processes, serverless functions, etc. To get started on monitoring Kafka clusters using Datadog, you may refer to this documentation from Datadog.
Kubernetes, or K8s, is an open source platform that automates Linux container operations, eliminating manual procedures involved in deploying and scaling containerized applications. AWS's Elastic Kubernetes Service (EKS) is a managed service that lets you deploy, manage, and scale containerized applications on Kubernetes. Datadog helps you monitor your EKS environments in real time. Because Datadog already integrates with Kubernetes and AWS, it is ready-made to monitor EKS.
The Confluent for Kubernetes (CFK) bundle contains Helm charts, templates, and scripts for deploying Confluent Platform to your Kubernetes cluster. You can deploy CFK using one of the following methods:
This blog post assumes you have Confluent Platform deployed on an AWS EKS cluster and running as described here. In order to get started with the AWS EKS cluster deployment, follow the steps in the documentation. Once you have the K8s cluster at your disposal, you can get started on installing CFK and Confluent Platform on the AWS EKS cluster nodes.
API keys are unique to your organization. An API key is required by the Datadog agent to submit metrics and events to Datadog. Once you are logged into the Datadog console, navigate to the Organizational settings in your Datadog UI and scroll to the API keys section. Create a new key and save it for future usage in Confluent Platform for integration on Kubernetes nodes. For the next steps, refer to this documentation: Create API key.
First, Datadog agents need to be installed on every node of the K8s cluster to collect metrics, logs, and traces from your Kafka deployment. For that to happen, you first need to ensure that Kafka and ZooKeeper are sending JMX data, then install and configure the Datadog agent on each of the producers, consumers, and brokers. It collects events and metrics from hosts and sends them to Datadog, where you can analyze your monitoring and performance data. It can run on your local hosts (Windows, macOS), containerized environments (Docker, Kubernetes), and in on-premises data centers. You can install and configure it using configuration management tools such as Chef, Puppet, or Ansible.
Datadog’s site name has to be set if you’re not using the default on datadoghq.com. You can pass it in the
values.yaml file or, more preferably, via the Helm command as shown above.
Note: If the
datadog.site variable is not explicitly set, it defaults to the US site datadoghq.com. If you are using one of the other sites (EU, US3, or US1-FED) this will result in an invalid API key message. Use Datadog’s documentation site selector to see appropriate names for the site you’re using.
To install the chart for Datadog, identify the right release name:
Using the Datadog
values.yaml configuration file as a reference, create a
values.yaml parameterized for your enterprise. Datadog recommends that your
values.yaml only contain values that need to be overridden, as it allows a smooth experience when upgrading chart versions. If this is a fresh install, add the Helm Datadog repo:
Retrieve your Datadog API key from your agent installation instructions and run:
Modify your Confluent Platform’s yaml file to reflect the Datadog annotations. Add the following annotations to each component-specific CRD (used for Datadog events). So autodiscovery will work, this example shows Kafka after the " / ’ , this is the name of the CR. The annotations are for Kafka, ZooKeeper, Connect, and Schema Registry. Replace the <cp-component> with the respective name.
Refer to the complete Confluent Platform yaml in this GitHub repo. After all the annotations are configured correctly in each component Custom Resource, you will now redeploy Confluent Platform on K8s using the following command:
Now it's time to integrate the Confluent Platform with Datadog. First, you need to install the integration with the Datadog Confluent Platform integration tile as shown in Figure 3. Navigate to the “Integrations” section on the left-hand side vertical menu.
Click the Install button on the Confluent Platform tile and you will now be presented with a widget that lets you configure the Datadog agents on your Kubernetes nodes where Confluent Platform’s Kafka clusters are located. Figures 4 and 5 demonstrate the overview of Confluent Platform-specific components from which Datadog collects JMX metrics and respective configurations.
When Datadog agents are installed on each of the K8s nodes, they should be displayed when you run the following command:
Execute into one of the Datadog agent pods and check the Datadog agent status:
Look for the jmxfetch section of the agent status output. It should now show the established Confluent Platform integration.
You will now be fully equipped with a comprehensive dashboard that shows all Confluent Platform metrics ranging from producer, consumer, broker, connect, ISRs, under replicated partitions, ksqlDB, and so on. According to your business need, you are now ready to explore, slice, and dice the individual widget.
Monitoring your Kubernetized Confluent Platform clusters deployed on AWS allows for proactive response, data security and gathering, and contributes to an overall healthy data pipeline. Datadog is one of the predominantly used SaaS network monitoring, infrastructure management, and application monitoring solutions used by many Confluent customers. This post walked through the integration of Confluent Platform with Datadog on a K8s platform like EKS, to monitor key metrics, logs, and traces from your Kafka environment. This allows you to leverage improved visibility into Kafka health and performance, and create automated alerts tailored to your infrastructure needs.