Don’t miss out on Current in New Orleans, October 29-30th — save 30% with code PRM-WEB | Register today

Monitor Kafka Streams Health Metrics in Confluent Cloud

Written By

It’s 3 a.m., and an alert fires: Your critical Kafka Streams application is lagging. The frantic troubleshooting begins. Is it a consumer group rebalance? You start searching through application logs across multiple pods. Is it a problem with the Apache Kafka® cluster itself? You switch to your cluster monitoring dashboards to check broker health. Or is there a silent bottleneck hidden deep in your application code? Without the right instrumentation, you're flying blind. After 30 minutes of digging through disparate tools, you're still just guessing.

We've heard this story from countless developers and operators. That’s why we’re excited to announce the launch of key Kafka Streams application health metrics directly in the Confluent Cloud Console. Now, you can quickly understand application state, identify potential bottlenecks, and monitor state store health without writing custom instrumentation code. This means less time firefighting, faster mean time to resolution (MTTR) for production incidents, and more developer cycles spent building features that drive your business. The best part? All you need to do is upgrade your Kafka Streams client library to any version above 4.0.

Instantly See Your Application's Health and Pinpoint Problem Instances

The new Kafka Streams page provides an at-a-glance health check of your entire Kafka Streams application. You can now see the overall application state, such as RUNNING, REBALANCING, PENDING_SHUTDOWN, or PENDING_ERROR, and drill down to view the state of each individual processing thread.

A common operational challenge when running multiple instances of a Kafka Streams application is that threads can have identical names, making it difficult to isolate which instance is having trouble. To solve this, the user interface (UI) now displays a unique process_id next to each thread. This ID corresponds to a specific application instance and is also printed in that instance's logs on startup.

This allows you to:

  • Disambiguate threads across all running application instances.

  • Correlate UI insights with your own observability tools. The process_id acts as a bridge, allowing you to take an insight from Confluent Cloud and immediately find the corresponding logs for that specific instance in your external platform.

Find Your Bottleneck: Is It Your Code or the Cluster?

At the heart of this feature are four essential metrics that reveal how your application allocates its time: the poll, process, commit, and punctuate ratios.

  • Poll ratio: The fraction of time the application is idle and waiting for new records. A high poll ratio is a sign of a healthy application.

  • Process and punctuate ratios: The fraction of time spent executing your business logic. High ratios often suggest a bottleneck in your application code or underlying hardware. As a general guideline, a process ratio > 0.8 or punctuate ratio > 0.3 warrants investigation.

  • Commit ratio: The fraction of time spent committing offsets. A consistently high ratio can point to network issues or a health problem with the Kafka cluster itself. You should investigate if you see a commit ratio > 0.2.

We also surface end-to-end record latency (average, minimum, and maximum) to provide a clear measure of processing timeliness.

Putting It All Together: A Troubleshooting Scenario

These metrics aren't just data points; they also function as a diagnostic tool that transforms your troubleshooting workflow. Let's replay that 3 a.m. alert with the new Kafka Streams UI:

  1. Check the application state. You first glance at the UI and see the overall state is RUNNING, not REBALANCING. This immediately rules out a consumer group issue.

  2. Examine the ratios. You look at the performance ratios. The commit ratio is low and healthy, suggesting that the cluster and network are fine. However, you see the process ratio is consistently high, spiking above 0.8.

  3. Get your answer. The bottleneck is in your application code.

In less than a minute, you've ruled out external factors and you know exactly where to focus your investigation, saving valuable time and effort.

Monitor State Store Health With RocksDB Metrics

For stateful applications, performance is tightly linked to memory management across the Java heap and RocksDB's native memory. The following metrics provide insight into both:

  • estimate-num-keys: This helps you track the overall size of your state. A rapidly increasing key count can indicate unbounded state growth. 

  • size-all-memtables: This metric reflects the off-heap memory used by RocksDB's in-memory write buffers, called memtables. When your application writes data, it first goes to an active memtable. By default, Kafka Streams configures RocksDB with up to three 16MB memtables per state store. When one memtable fills up, it becomes immutable and is scheduled to be flushed to a file on disk (an SST file) while a new one accepts incoming writes.

    • How to interpret this metric: A consistently high value (e.g., approaching the maximum of 48MB per store) indicates a heavy write workload. If the value remains high without decreasing, it may signal that disk input/output is not fast enough to keep up with the flush rate, pointing to a potential storage bottleneck.

  • block-cache-usage: This shows the off-heap memory used by the RocksDB block cache, which holds frequently accessed data blocks from disk to accelerate reads. By default, Kafka Streams configures this cache with a size of 50MB per store. For read-heavy applications, you may need to increase this size to ensure that your working data set fits in the cache, which can dramatically improve performance. The size can be configured via a RocksDBConfigSetter implementation.

It's important to also consider the Kafka Streams record cache, which resides on the Java heap. This cache reduces the number of writes to RocksDB. Its size is controlled by the statestore.cache.max.bytes configuration, which defaults to 10MB per application instance. This memory is then divided evenly among the running threads of that instance. Increasing this cache can improve write performance.

Note: These RocksDB metrics are currently aggregated across all state stores and threads. They serve as a heuristic for the overall health of your application's state.

For a deeper dive, refer to the Kafka Streams Developer Guide for Memory Management in our documentation.

Powered by the Community

This new capability is built on KIP-1076 and KIP-1091, both of which Confluent is proud to have contributed to the open source community. This work makes core Kafka Streams metrics available natively, and we’re thrilled to surface them directly in the Confluent Cloud UI to simplify operations for every developer.

Get Started in Minutes

Gaining this new visibility is incredibly simple. There’s no complex setup or instrumentation code to write. To enable these metrics, all you need to do is ensure that your application is built with a Kafka Streams client version above 4.0.

Once you deploy your updated application, the metrics will automatically appear in the Kafka Streams page for your cluster in the Confluent Cloud UI. For teams who want to integrate this data into existing monitoring dashboards, all metrics will be available via the Confluent Cloud Metrics API and in our exportable formats for Prometheus and Datadog.

Providing this foundational visibility is the first step in our mission to help you master the operational challenges of Kafka Streams. We're already working on future improvements that will provide even deeper insights into your Kafka Streams workloads as well as tooling that gives you more operational control than ever. Stay tuned!

Disclaimer: The guidelines above are general rules of thumb. The interpretation of the metrics is highly dependent on your specific workload. Think of these guidelines as a starting point for understanding your application's behavior and narrowing down your investigation.


Confluent and associated marks are trademarks or registered trademarks of Confluent, Inc.

Apache® and Apache Kafka® and the respective logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by the Apache Software Foundation is implied by using these marks. All other trademarks are the property of their respective owners.

  • Daniel Joseph is a Senior Product Manager at Confluent working on Kafka Streams. Prior to Confluent, he held product roles focused on developer platforms at Google Cloud and Wayfair. He started his career building enterprise data pipelines at Cloudera. Daniel holds a Software Engineering degree from the University of Waterloo and is a Master's candidate in Computer Science at Georgia Tech. He is driven by the question of what tooling developers will need for an AI-native, streaming future.

Did you like this blog post? Share it now