Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Observability of Confluent Platform on AWS EKS with DataDog

Written By

Setting up proactive, synthetic monitoring is critical for complex, distributed systems like Apache Kafka®, especially when deployed on Kubernetes and where the end-user experience is concerned, and is paramount for healthy real-time data pipelines. A key benefit for operations teams running Kafka on Kubernetes is infrastructure abstraction: it can be configured once and run everywhere. 

When integrated with Confluent Platform, Datadog can help visualize the performance of the Kafka cluster in real time and also correlate the performance of Kafka with the rest of your applications.

While Confluent recommends our customers use Confluent Cloud to monitor clusters for you, there are situations where you may need to self-host a Confluent Platform cluster on a cloud provider or on premises.

This blog post shows you how you can get more comprehensive visibility into your deployed Confluent Platform using Confluent for Kubernetes (CFK) on Amazon Kubernetes Service (AWS EKS), by collecting all Kafka telemetry data in one place and tracking it over time using Datadog. 

Preliminary 

Confluent for Kubernetes (CFK) is a cloud-native control plane for deploying and managing Confluent in your private cloud environment. It provides a standard and simple interface to customize, deploy, and manage Confluent Platform through a declarative API. 

Datadog is a monitoring and analytics tool for IT and DevOps teams that can be used to determine performance metrics as well as event monitoring for infrastructure and cloud services. It can monitor services such as servers, databases, cloud infrastructure, system processes, serverless functions, etc. To get started on monitoring Kafka clusters using Datadog, you may refer to this documentation from Datadog.

Kubernetes, or K8s, is an open source platform that automates Linux container operations, eliminating manual procedures involved in deploying and scaling containerized applications. AWS's Elastic Kubernetes Service (EKS) is a managed service that lets you deploy, manage, and scale containerized applications on Kubernetes. Datadog helps you monitor your EKS environments in real time. Because Datadog already integrates with Kubernetes and AWS, it is ready-made to monitor EKS.

Deploy Confluent Platform with CFK on AWS EKS

The Confluent for Kubernetes (CFK) bundle contains Helm charts, templates, and scripts for deploying Confluent Platform to your Kubernetes cluster. You can deploy CFK using one of the following methods:

This blog post assumes you have Confluent Platform deployed on an AWS EKS cluster and running as described here. In order to get started with the AWS EKS cluster deployment, follow the steps in the documentation. Once you have the K8s cluster at your disposal, you can get started on installing CFK and Confluent Platform on the AWS EKS cluster nodes. 

Creating Datadog API keys

API keys are unique to your organization. An API key is required by the Datadog agent to submit metrics and events to Datadog. Once you are logged into the Datadog console, navigate to the Organizational settings in your Datadog UI and scroll to the API keys section. Create a new key and save it for future usage in Confluent Platform for integration on Kubernetes nodes. For the next steps, refer to this documentation: Create API key.

Figure 1: Navigate to the API keys section on Datadog console

Figure 2: Create new API keys on Datadog console

Install Datadog agents 

First, Datadog agents need to be installed on every node of the K8s cluster to collect metrics, logs, and traces from your Kafka deployment. For that to happen, you first need to ensure that Kafka and ZooKeeper are sending JMX data, then install and configure the Datadog agent on each of the producers, consumers, and brokers. It collects events and metrics from hosts and sends them to Datadog, where you can analyze your monitoring and performance data. It can run on your local hosts (Windows, macOS), containerized environments (Docker, Kubernetes), and in on-premises data centers. You can install and configure it using configuration management tools such as Chef, Puppet, or Ansible. 

Configuring the Datadog site name

Datadog’s site name has to be set if you’re not using the default on datadoghq.com. You can pass it in the values.yaml file or, more preferably, via the Helm command as shown above.

Note: If the datadog.site variable is not explicitly set, it defaults to the US site datadoghq.com. If you are using one of the other sites (EU, US3, or US1-FED) this will result in an invalid API key message. Use Datadog’s documentation site selector to see appropriate names for the site you’re using.

To install the chart for Datadog, identify the right release name:

  1. Install Helm.

  2. Using the Datadog values.yaml configuration file as a reference, create a values.yaml parameterized for your enterprise. Datadog recommends that your values.yaml only contain values that need to be overridden, as it allows a smooth experience when upgrading chart versions. If this is a fresh install, add the Helm Datadog repo:

  3. Retrieve your Datadog API key from your agent installation instructions and run:

Integrating Confluent Platform with DataDog

Annotations with CP and CFK

Modify your Confluent Platform’s yaml file to reflect the Datadog annotations. Add the following annotations to each component-specific CRD (used for Datadog events). So autodiscovery will work, this example shows Kafka after the " / ’ , this is the name of the CR. The annotations are for Kafka, ZooKeeper, Connect, and Schema Registry. Replace the <cp-component> with the respective name.

Spec:
 podTemplate
annotations:
       ad.datadoghq.com/<cp-component>.check_names: '["confluent_platform"]'
 ad.datadoghq.com/<cp-component>.init_configs: '[{"is_jmx": true, "collect_default_metrics": true, "service_check_prefix": "confluent", "new_gc_metrics": true, "collect_default_jvm_metrics": true}]'
            ad.datadoghq.com/<cp-component>.instances:'[{"host":"%%host%%","port":"7203","max_returned_metrics":300]'
     ad.datadoghq.com/<cp-component>.logs: '[{"source":"confluent_platform","service":"confluent_platform"}]'

Refer to the complete Confluent Platform yaml in this GitHub repo. After all the annotations are configured correctly in each component Custom Resource, you will now redeploy Confluent Platform on K8s using the following command:

  kubectl apply -f $CONFLUENT_HOME/confluent-platform-datadog-cfk.yaml

Install Confluent Platform plugin on DataDog

Now it's time to integrate the Confluent Platform with Datadog. First, you need to install the integration with the Datadog Confluent Platform integration tile as shown in Figure 3. Navigate to the “Integrations” section on the left-hand side vertical menu.

Figure 3: Datadog Console showing Integration tab with Confluent Platform integration

Click the Install button on the Confluent Platform tile and you will now be presented with a widget that lets you configure the Datadog agents on your Kubernetes nodes where Confluent Platform’s Kafka clusters are located. Figures 4 and 5 demonstrate the overview of Confluent Platform-specific components from which Datadog collects JMX metrics and respective configurations. 

Figure 4: Confluent Platform installation overview on Integrations tab

Figure 5: Confluent Platform installation widget with required configurations

Validate Datadog agent installation

When Datadog agents are installed on each of the K8s nodes, they should be displayed when you run the following command:

  kubectl get pods -l app.kubernetes.io/component=agent 
  kubectl exec -it <any of the pods above> -- bash  

Desired output: 

Execute into one of the Datadog agent pods and check the Datadog agent status:

kubectl exec -it <datadog agent pods > -- bash 
agent status 

Look for the jmxfetch section of the agent status output. It should now show the established Confluent Platform integration.

  ========
      JMXFetch
      ========
      Information
      ==================
      runtime_version : 11.0.16
      version : 0.46.0
      Initialized checks
      ==================
      confluent_platform
      instance_name : confluent_platform-10.92.6.5-7203
      message : <no value>
      metric_count : 115
      service_check_count : 0
      status : OK

Verify Confluent Platform dashboard

You will now be fully equipped with a comprehensive dashboard that shows all Confluent Platform metrics ranging from producer, consumer, broker, connect, ISRs, under replicated partitions, ksqlDB, and so on. According to your business need, you are now ready to explore, slice, and dice the individual widget. 

Figure 6: Confluent Platform Datadog dashboard

Conclusion

Monitoring your Kubernetized Confluent Platform clusters deployed on AWS allows for proactive response, data security and gathering, and contributes to an overall healthy data pipeline. Datadog is one of the predominantly used SaaS network monitoring, infrastructure management, and application monitoring solutions used by many Confluent customers. This post walked through the integration of Confluent Platform with Datadog on a K8s platform like EKS, to monitor key metrics, logs, and traces from your Kafka environment. This allows you to leverage improved visibility into Kafka health and performance, and create automated alerts tailored to your infrastructure needs. 

  • Geetha Anne is a solutions engineer at Confluent with previous experience in executing solutions for data-driven business problems on cloud, involving data warehousing and real-time streaming analytics. She fell in love with distributed computing during her undergraduate days and followed her interest ever since. Geetha provides technical guidance, design advice, and thought leadership to key Confluent customers and partners. She also enjoys teaching complex technical concepts to both tech-savvy and general audiences.

  • Moshe Blumberg is a senior storage and systems engineer at Confluent with high-level experience, focusing in areas of technical support, project management implementation, and technical marketing.

Did you like this blog post? Share it now