[Webinar] Master Apache Kafka Fundamentals with Confluent | Register Here
A global investment bank partnered with Confluent to achieve sub-5ms, p99, end-to-end latency for critical trading pipelines with strict durability requirements. Through a meticulous combination of architectural discipline, in-depth monitoring, and nuanced configuration (for brokers, producers and consumers), the teams not only met their initial goal of 100,000 messages/sec, but ultimately sustained this ultra-low latency at an astounding 1.6 million messages/sec with messages smaller than 5KB.
This achievement underscores the importance of strategic architectural choices, a monitoring-first philosophy, and the advanced diagnostic techniques required for success in high-stakes Apache Kafka® environments. The project established durable, high-throughput pipelines with robust order guarantees and evident tail performance, providing pivotal lessons for anyone seeking order-preserving, high-volume streaming with Kafka at scale.
Want to learn more about Kafka fundamentals to understand end-to-end latency in Kafka deployments before diving into this case study? Check out the blog post, “99th Percentile Latency at Scale With Apache Kafka,” or download the white paper “Optimizing Your Apache Kafka Deployment.”
In the demanding world of global capital markets, a single millisecond can dictate significant opportunities or risks in real-time trading. For the largest financial institutions, platform engineering is driven not by average latency, but by rare latency outliers. While most tuning projects focus on median (p50) or 95th percentile (p95) latency, a major investment bank's modernization initiative set a more ambitious standard: p99 in a multi-data center deployment. This demanded end-to-end delivery guarantees, strict order preservation, and full disaster recovery readiness at scale.
Achieving this stringent goal required a comprehensive approach that extended beyond producer-only optimization. The team meticulously instrumented every stage of the Kafka message path, effectively identifying and mitigating sources of "tail latency" (e.g., resource bottlenecks, inefficient consumer configurations, unexpected producer traffic peaks, data skew across partitions, JVM garbage collection pauses, etc.) that would typically remain undetected. This rigorous process ultimately redefined the capabilities of open source streaming for mission-critical workloads.
Unlike most Kafka tuning stories—often centered on producer performance in a single data center setup—this initiative focused on the end-to-end message latency in a multi-data center deployment. This setup called for deep collaboration between Confluent experts and the customer’s engineering and infrastructure teams. Using both JMX monitoring and infrastructure monitoring gave everyone involved visibility into where time was spent and where bottlenecks emerged. Together, we scrutinized every layer of the pipeline—producer, broker, and infrastructure—to diagnose latency bottlenecks and tune configurations for optimal results.
The initial test configuration was critical, utilizing a single Kafka partition with four replicas. Leveraging the multi-region cluster’s topic placement constraints, the team explicitly placed replicas across two data centers—a replica and an observer in each data center. With a measured ping latency of approximately 0.3–0.4 ms between these sites, the team established an immutable physical “floor” for the project; any duration beyond this baseline represented overhead to be optimized. This design minimized the number of required in-sync replicas, facilitating both low latency and quick failover during disaster recovery.
The deliberate focus on a single partition enabled the team to maintain strict message order, which is essential for financial event integrity. They did this by isolating and measuring latency limits without the complexities introduced by partition parallelism.
The broader cluster was provisioned with eight brokers, creating a robust environment for future scalability tests. However, by channeling benchmark traffic exclusively through a single topic partition with four replicas, only four brokers were actively involved in the critical path for low-latency validation. This minimal setup was a “lowest baseline first” methodology, enabling meticulous scrutiny of every intermediate delay agent. Once the lowest achievable latency was validated in this controlled environment, the architecture could be confidently expanded—adding partitions to meet business workload needs while retaining the insights around scaling effects.
The hardest part came with optimizing for p99 latency. By definition, the 99th percentile captures the rarest and largest outliers—moments when brokers, disk, or network hiccups can create unpredictable latency. While p95 tuning eliminates most common delays, it does not protect mission-critical financial workflows from those tail events. Thus, configuration changes, hardware upgrades, and monitoring protocols were evaluated not just for average improvements but also for their ability to flatten the latency curve deep into the distribution.
A pivotal tool throughout was the OpenMessaging Benchmark (OMB), an open source test harness built for scalable, reproducible workload generation and latency/throughput observability. OMB orchestrated real-world traffic patterns, capturing high-fidelity latency histograms, and producing actionable, reproducible insights on every aspect of Kafka performance. OMB's modular framework streamlined testing across configuration and infrastructure changes and upgrades.
For readers interested in replicating or extending these performance tests, OMB is openly documented and supported by a large community. Reference documentation and code can be found at the OpenMessaging Benchmark GitHub repository, and detailed installation and usage guides are available on the repo’s Wiki and README, making it a robust choice for comparative and long-term performance engineering.
The architecture was iteratively improved, not through a single overhaul, but rather by incremental enhancements. These allowed for continuous optimization and adaptation, building on lessons learned to create a robust, efficient, ultra-low latency trading pipeline. This methodical approach ensured system stability and performance as each change was carefully evaluated and integrated.
Early OMB benchmarks uncovered classic distributed system headaches under load: offset commit failures, sporadic record expirations, group rebalance anomalies, and latency outliers. The teams followed an iterative loop—measure, diagnose, optimize—with each cycle underpinned by cross-layer metrics.
Below, we classify and explain the most important metrics that proved indispensable during this engagement.
Summary of Kafka Observability Metrics Used to Tune for Ultra-Low Latency
Type of Metrics | Insight gained | Metrics Used | Adjustments Made |
Broker metrics | Pinpointing cluster bottlenecks | MessagesInPerSec / BytesInPerSec / BytesOutPerSec RequestsPerSec RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, ResponseSendTimeMs UnderReplicatedPartitions and ISR Health | Broker configurations: num.network.threads = 48 num.io.threads = 96 num.replica.fetchers = 8 log.segment.bytes tuned upward from 1GB Heap size 64GB |
Producer metrics | Exposing send-side latency and batching | request-latency-avg / request-latency-max
batch-size-avg and bufferpool-wait-time record-send-rate and records-per-request-avg record-error-rate and record-retry-rate | Producer configuration tuning: linger.ms = 1 batch.size = 32,000 compression.type = lz4 max.in.flight.requests.per.connection = 1 enable.idempotence=true acks=all |
Consumer metrics | Ensuring true end-to-end delivery | records-consumed-rate / bytes-consumed-rate fetch-latency-avg / fetch-latency-max records-lag-max commit-latency-avg / rebalance-latency-max | Consumer configuration tuning: max.poll.records = 10,000 max.poll.interval.ms = 200 |
Host metrics | Validating infrastructure health | CPU utilization and load average Memory usage and swappiness Disk I/O and latency Network throughput and errors | Infrastructure upgrades: Confluent Platform version 7.8 Upgraded bare metal XFS file system vm.swappiness = 1 Generational ZGC (Z Garbage Collector) JDK 21 |
MessagesInPerSec / BytesInPerSec / BytesOutPerSec: These throughput metrics revealed real-time load on each broker, helping us correlate spikes in traffic with latency anomalies and resource saturation.
RequestsPerSec: Provided insight into the rate of handled requests, segmented by type (Produce, FetchConsumer, FetchFollower), which is closely linked to CPU and network utilization.
RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, ResponseSendTimeMs: This suite of request timing metrics allowed us to break down the lifecycle of each Kafka request. Notably, RemoteTimeMs was critical for identifying delays caused by slow in-sync replicas (ISRs) when using acks=all, while RequestQueueTimeMs and ResponseQueueTimeMs flagged thread or CPU contention.
UnderReplicatedPartitions and ISR Health: Monitoring the number of under-replicated partitions and the health of the ISR set was essential for both durability and latency. Any lagging broker in the ISR could inflate p99 latency, making these metrics vital for proactive remediation.
RequestQueueSize / ResponseQueueSize: Indicated the depth of queued requests and responses, helping to spot resource exhaustion or thread bottlenecks.
NetworkProcessorAvgIdlePercent / RequestHandlerAvgIdlePercent: These idleness metrics helped identify when brokers were running at or near capacity, which could lead to increased queuing and latency.
request-latency-avg / request-latency-max: These metrics quantified both average and worst-case producer request times, directly surfacing the impact of configuration changes and cluster health on end-to-end latency.
batch-size-avg and bufferpool-wait-time: By tracking batch sizes and buffer pool utilization, we could fine-tune batching parameters for optimal throughput without introducing queuing delays.
record-send-rate and records-per-request-avg: Provided a pulse on production rates and batching effectiveness.
record-error-rate and record-retry-rate: Flagged any delivery issues or retries that could cascade into higher tail latency.
records-consumed-rate / bytes-consumed-rate: These metrics confirmed that consumers were keeping pace with the broker, ensuring that downstream lag was not masking upstream improvements.
fetch-latency-avg / fetch-latency-max: Measuring fetch latencies at the consumer side was crucial for validating that end-to-end latency goals were being met, not just producer-to-broker.
records-lag-max: Highlighted any lag in consumption, which could otherwise introduce hidden tail delays.
commit-latency-avg / rebalance-latency-max: These metrics tracked the time spent in offset commits and group rebalances, often the source of rare, high p99 delays.
CPU Utilization and Load Average: High CPU usage or load could indicate thread starvation or resource contention, directly impacting broker and client performance.
Memory Usage and Swappiness: Monitoring memory consumption and swap activity (with vm.swappiness set to 1) helped ensure predictable performance and avoid latency spikes due to paging.
Disk I/O and Latency: Disk throughput and latency metrics were essential for detecting storage bottlenecks, especially during log segment rolls or under heavy ingestion.
Network Throughput and Errors: Network metrics ensured that data center links and broker interconnects were not introducing unpredictable delays or packet loss.
By focusing on this core set of metrics, performance tuning became a scientific, iterative process. Each metric provided a lens into a specific part of the pipeline, enabling targeted adjustments and validation of every configuration, hardware, or architectural change.
linger.ms = 1: Adds up to 1 ms of delay to allow more records to batch together, slightly increasing latency but improving throughput and compression efficiency
batch.size = 32,000: Sets the maximum batch size to 32,000 bytes, allowing larger batches for better throughput and compression, but using more memory per partition
compression.type = lz4: Enables LZ4 compression for batches, reducing network usage and storage at the cost of some CPU overhead; LZ4 is fast and efficient.
max.in.flight.requests.per.connection = 1: Only one unacknowledged request is sent at a time, ensuring strict message ordering and supporting idempotence
enable.idempotence=true: Guarantees exactly-once delivery by preventing duplicate messages
acks=all: Waits for all in-sync replicas to acknowledge a message before considering it sent, providing the highest durability guarantee
max.poll.records = 10,000: Allows the consumer to fetch up to 10,000 records in a single poll, increasing batch processing efficiency.
max.poll.interval.ms = 200: Sets the maximum allowed time (200 ms) between poll calls before the consumer is considered dead
num.network.threads = 48: Increases the number of threads handling network requests, allowing the broker to process more client connections and network traffic in parallel, which can improve throughput for high-connection workloads.
num.io.threads = 96: Allocates more threads for disk I/O operations (such as reading/writing data and log segments), enabling higher parallelism for disk-bound tasks and potentially improving performance under heavy load.
num.replica.fetchers = 8: Sets the number of threads used by the broker to fetch data from leader partitions for replication. More fetchers can speed up replication and reduce lag between replicas, especially in clusters with many partitions.
log.segment.bytes tuned upward from 1GB: Increasing the log segment size reduces the frequency of segment rollovers and file creation, which can minimize blocking and improve write throughput,
Heap size 64GB: Allocates a large Java heap for the broker, allowing it to cache more data and handle larger workloads.
To address p99 latency spikes caused by JVM garbage collection and disk I/O contention observed during initial benchmarks, the following upgrades were implemented (this framing connects infrastructure choices directly to the problems they solve, as recommended):
Confluent Platform version 7.8
Upgraded bare metal: 56 cores, 1 TB RAM, 2x3 TB SSDs in RAID 10 configuration enterprise SSDs, yielding reduced jitter and tighter tail latency under extreme throughput.
XFS file system: High-performance, designed for scalability and parallelism. Efficiently handles large files, supports high IOPS workloads, and enables fast parallel I/O operations. Particularly effective for Kafka due to minimized file system overhead and reduced locking contention.
vm.swappiness = 1: Minimizes the Linux kernel’s tendency to swap application memory to disk, helping ensure that Kafka and other critical processes keep their data in RAM for better performance and lower latency.
Generational ZGC (Z Garbage Collector): Provides much lower and more predictable pause times, even with large heap sizes. This reduces latency spikes and helps Kafka brokers maintain high throughput and responsiveness, especially under heavy load or with large memory allocations.
JDK 21: Virtual threads and other Java enhancements, including garbage collector and JIT compiler improvements, boost throughput and reduce latency for concurrent applications.
Log in to the Confluent Support portal and learn more in the article, “What Are Possible Kafka Broker Performance Diagnostics Techniques?”
Through extensive testing involving thousands of permutations, we identified specific non-default parameter values that significantly enhanced performance for this specific deployment and use case, which means we've only focused on these key settings outlined in this blog post. Tests were executed for durations ranging from 10 to 60 minutes. Stress tests, however, were conducted over a full 24-hour period.
Our approach began by optimizing producer and consumer configurations, followed by fine-tuning broker settings. These initial adjustments successfully met our latency requirements, though occasional spikes still occurred. Our focus then shifted to the underlying infrastructure, which proved crucial in eliminating anomalies and delivering consistent, repeatable performance.
To truly capture the system's behavior before and after tuning, we standardized our benchmarking on a specific set of "Golden Signals" and configuration parameters. Rather than relying on simple averages, we meticulously tracked the following for every test iteration:
Granular Latency Percentiles: We moved beyond p95, recording end-to-end (E2E) latency at p50, p75, p95, p99, p99.9, and even p99.99. This spectrum was vital for exposing the "tail" events that define outlier performance.
Durability & Safety: We locked in acks=all, min.insync.replicas, and enable.idempotence to guarantee that speed never came at the cost of data safety.
Producer & Consumer Tuning: We isolated the impact of specific client-side settings, specifically tracking changes to linger.ms, batch.size, compression.type, max.poll.records, max.poll.interval.ms, and auto.commit.interval.ms.
Topology & Workload: Each run explicitly defined the physical constraints, including partition count, broker count, and producer rate, ensuring we could correlate throughput pressure with latency degradation.
Thanks to fellow team members like Kyle Moynihan and Phoebe Bakanas, we were able to meet the customers’ needs and establish the following best practices for tuning order-preserving, high-volume Kafka workloads:
Request Timing Anatomy Unlocks True Diagnosis: While standard metrics reveal system health, dissecting request timing (e.g., RemoteTimeMs) uniquely enables diagnosis of distributed state problems—especially for acks=all deployments where ISR lag is the hidden enemy of p99 latency.
Single-Partition Baseline Is Non-Negotiable: Attempting to scale out before understanding the single-partition latency floor introduces confounding variables that make true p99 optimization impossible. Start minimal, validate order and latency, then scale with confidence.
Infrastructure Is the Ultimate Bottleneck: No amount of software tuning can compensate for disk, network, or GC-induced tail events. Upgrading to enterprise SSDs, XFS, and ZGC was essential for eliminating rare, unpredictable latency spikes.
Scientific, Iterative Tuning Is Essential: Every change—hardware, JVM, configuration—must be validated by its impact on latency percentiles, not just averages. OMB and disciplined monitoring made this possible.
Reproducibility Is Power: Open, scenario-driven, use case and deployment-specific benchmarks are the antidote to “point-in-time” test fallacies. They ensure every config or hardware iteration is benchmarked for true impact, not just anecdotal improvement.
See how you can apply these lessons for your own low-latency use case and environment—download the “Optimizing Your Apache Kafka Deployment” white paper to learn more.
Apache®, Apache Kafka®, Kafka®, and the Kafka logo are registered trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by the Apache Software Foundation is implied by using these marks. All other trademarks are the property of their respective owners.
Fraud detection, payment systems, and stock trading platforms are only a few of many Apache Kafka® use cases that require both fast and predictable delivery of data. For example, detecting […]
Learn how to build a custom Kafka connector, which is an essential skill for anyone working with Apache Kafka® in real-time data streaming environments with a wide variety of data sources and sinks.