Join us at Current New Orleans! Save $500 with early bird pricing until August 15 | Register Now
A joint post from the teams at NeuBird and Confluent
For organizations leveraging Confluent, ensuring smooth operations is mission-critical. While Confluent Cloud eliminates the operational burden of managing Apache Kafka®, application teams still need to monitor and troubleshoot client applications connecting to Kafka clusters.
Traditionally, when issues arise—whether it's unexpected consumer lag, authorization errors, or connectivity problems—engineers must manually piece together information from multiple observability tools, logs, and metrics to identify root causes. This process is time-consuming, requires specialized expertise, and often extends resolution times.
Today, we're excited to share how Hawkeye by NeuBird, a site reliability engineer (SRE) assistant powered by generative artificial intelligence (GenAI), is transforming this experience by automating the investigation and resolution of Confluent Cloud incidents—allowing your team to focus on innovation rather than firefighting.
Confluent's observability setup provides a strong foundation for monitoring Kafka clients connected to Confluent Cloud. It leverages:
A time-series database (Prometheus) for metrics collection
Client metrics from Java consumers and producers
Visualization through Grafana dashboards
Failure scenarios to learn from and troubleshoot
The demo is incredibly valuable for understanding how to monitor Kafka clients and diagnose common issues, but it still relies on human expertise to interpret the data and determine root causes.
NeuBird builds on Confluent's robust observability foundation by integrating Hawkeye directly into the Kafka monitoring ecosystem. This combination goes beyond monitoring to introduce intelligent, automated incident response, significantly reducing mean time to resolution (MTTR).
NeuBird augments Confluent's observability with three significant improvements:
Kubernetes deployment: Containerized the entire setup and made it deployable on Kubernetes (Amazon EKS), making it more representative of production environments and easier to deploy
Alertmanager integration: Added Prometheus Alertmanager rules that trigger PagerDuty incidents, creating a complete alerting pipeline
Audit logging: Expanded the telemetry scope to include both metrics and logs in Amazon CloudWatch, giving a more comprehensive view of the environment
Most importantly, we've integrated Hawkeye to automatically investigate and resolve incidents as they occur, significantly reducing MTTR.
Let's walk through a real-world scenario from the Confluent demo: the "authorization revoked" case, where a producer's permission to write to a topic is unexpectedly revoked.
In the original demo workflow, here's what typically happens:
An engineer receives an alert about producer errors.
They log into Grafana to check producer metrics.
They notice the record error rate has increased.
They check Confluent Cloud metrics and see inbound traffic but no new retained bytes.
They examine producer logs and find TopicAuthorizationException errors.
They investigate access control lists (ACLs) and find the producer's permissions were revoked.
They restore the correct ACLs to resolve the issue.
This manual process might take 15-30 minutes for an experienced Kafka engineer, assuming they're immediately available when the alert triggers.
With our enhanced setup including Hawkeye, this is how the workflow is transformed:
Prometheus Alertmanager detects increased error rates and triggers a PagerDuty incident.
Hawkeye automatically begins investigating the issue by:
Retrieving and analyzing producer metrics from Prometheus
Correlating with Confluent Cloud metrics
Examining producer logs for error patterns
Checking Amazon CloudWatch for audit logs showing ACL changes
Within minutes, Hawkeye identifies the TopicAuthorizationException and links it to recent ACL changes.
Hawkeye generates a detailed root cause analysis with specific remediation steps.
An engineer reviews Hawkeye's findings and applies the recommended fix (or optionally approves Hawkeye to implement the fix automatically).
The entire process is now reduced to minutes, even when the issue occurs outside business hours. More importantly, your specialized Kafka engineers can focus on more strategic work rather than routine troubleshooting.
In this video, we demonstrate the complete workflow:
Deploying the enhanced Confluent observability solution to Kubernetes
Triggering the authorization revocation scenario
Watching Hawkeye automatically detect, investigate, and diagnose the issue
Reviewing Hawkeye's detailed analysis and remediation recommendations
Implementing the fix and verifying the resolution
NeuBird’s enhanced solution builds on Confluent's observability foundation with several key components:
Kubernetes deployment: All components are packaged as containers and deployed to Amazon EKS using Helm charts, making the setup reproducible and scalable.
Prometheus and Alertmanager: Custom alerting rules are specifically designed for Confluent Cloud metrics and common failure patterns.
Amazon CloudWatch integration: Both metrics and logs are forwarded to CloudWatch, providing a centralized location for all telemetry data.
Hawkeye integration: Hawkeye connects securely to your telemetry sources with read-only permissions, leveraging GenAI to understand patterns, correlate events, and recommend precise solutions.
The architecture respects all security best practices. Hawkeye never stores your telemetry data and operates with minimal permissions, and all analysis happens in ephemeral, isolated environments.
Organizations using Hawkeye with Confluent Cloud have seen significant operational improvements:
Reduced MTTR: Issues that previously took hours to diagnose are now resolved in minutes.
Decreased alert fatigue: Engineers are engaged only when human intervention is truly needed.
Knowledge democratization: Teams less familiar with Kafka can confidently operate complex Confluent Cloud environments.
Improved service level agreements: With faster resolution times, application availability and performance metrics improve.
An enterprise IT storage company, for example, reduced its MTTR for DevOps pipeline failures by implementing Hawkeye. When experiencing a crash loop with one of its applications, causing production downtime, Hawkeye automatically picked up the alert from PagerDuty, investigated the issue, and determined that the crashes were happening due to a recent application deployment. Hawkeye recommended which specific application and process needed to be rolled back, dramatically reducing resolution time.
Want to try this enhanced observability setup within your own Confluent Cloud environment? Here's how to get started:
Start with the original Confluent observability demo to understand the components.
Check out our GitHub repository for the Kubernetes-ready version with Prometheus Alertmanager rules.
Schedule a demo to see Hawkeye in action.
The combination of Confluent Cloud and NeuBird's Hawkeye is a powerful shift in how organizations operate Kafka environments. By leveraging Confluent's rich telemetry data and Hawkeye's GenAI-powered automation, teams can significantly reduce operational overhead, improve reliability, and focus on delivering value rather than troubleshooting infrastructure.
As data streaming becomes increasingly central to modern applications, this type of intelligent automation will be essential for scaling operations teams effectively—letting them support larger, more complex deployments without proportionally increasing headcount or sacrificing reliability.
Apache®, Apache Kafka®, and Kafka® are registered trademarks of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of these marks.
A behind-the-scenes look at why hosted Kafka falls short—and how Confluent Cloud’s architecture solves for cost, resilience, and operational simplicity at scale.
Existing Confluent Cloud (CC) AWS users can now use Tableflow to easily represent Kafka topics as Iceberg tables and then leverage AWS Glue Data catalog to power real-time AI and analytics workloads.