Ahorra un 25 % (o incluso más) en tus costes de Kafka | Acepta el reto del ahorro con Kafka de Confluent

How a Tier‑1 Bank Tuned Apache Kafka® for Ultra‑Low‑Latency Trading

Escrito por

A global investment bank partnered with Confluent to achieve sub-5ms, p99, end-to-end latency for critical trading pipelines with strict durability requirements. Through a meticulous combination of architectural discipline, in-depth monitoring, and nuanced configuration (for brokers, producers and consumers), the teams not only met their initial goal of 100,000 messages/sec, but ultimately sustained this ultra-low latency at an astounding 1.6 million messages/sec with messages smaller than 5KB.

This achievement underscores the importance of strategic architectural choices, a monitoring-first philosophy, and the advanced diagnostic techniques required for success in high-stakes Apache Kafka® environments. The project established durable, high-throughput pipelines with robust order guarantees and evident tail performance, providing pivotal lessons for anyone seeking order-preserving, high-volume streaming with Kafka at scale.

Want to learn more about Kafka fundamentals to understand end-to-end latency in Kafka deployments before diving into this case study? Check out the blog post, “99th Percentile Latency at Scale With Apache Kafka,” or download the white paper “Optimizing Your Apache Kafka Deployment.”

The Millisecond Race With Apache Kafka®

In the demanding world of global capital markets, a single millisecond can dictate significant opportunities or risks in real-time trading. For the largest financial institutions, platform engineering is driven not by average latency, but by rare latency outliers. While most tuning projects focus on median (p50) or 95th percentile (p95) latency, a major investment bank's modernization initiative set a more ambitious standard: p99 in a multi-data center deployment. This demanded end-to-end delivery guarantees, strict order preservation, and full disaster recovery readiness at scale.

Achieving this stringent goal required a comprehensive approach that extended beyond producer-only optimization. The team meticulously instrumented every stage of the Kafka message path, effectively identifying and mitigating sources of "tail latency" (e.g., resource bottlenecks, inefficient consumer configurations, unexpected producer traffic peaks, data skew across partitions, JVM garbage collection pauses, etc.) that would typically remain undetected. This rigorous process ultimately redefined the capabilities of open source streaming for mission-critical workloads.

Deconstructing the Millisecond: A “Lowest Baseline First” Methodology

Unlike most Kafka tuning stories—often centered on producer performance in a single data center setup—this initiative focused on the end-to-end message latency in a multi-data center deployment. This setup called for deep collaboration between Confluent experts and the customer’s engineering and infrastructure teams. Using both JMX monitoring and infrastructure monitoring gave everyone involved visibility into where time was spent and where bottlenecks emerged. Together, we scrutinized every layer of the pipeline—producer, broker, and infrastructure—to diagnose latency bottlenecks and tune configurations for optimal results.

The initial test configuration was critical, utilizing a single Kafka partition with four replicas. Leveraging the multi-region cluster’s topic placement constraints, the team explicitly placed replicas across two data centers—a replica and an observer in each data center. With a measured ping latency of approximately 0.3–0.4 ms between these sites, the team established an immutable physical “floor” for the project; any duration beyond this baseline represented overhead to be optimized. This design minimized the number of required in-sync replicas, facilitating both low latency and quick failover during disaster recovery.

Latency Test Configuration With Four Replicas of a Single Kafka Partition.

The deliberate focus on a single partition enabled the team to maintain strict message order, which is essential for financial event integrity. They did this by isolating and measuring latency limits without the complexities introduced by partition parallelism.

The broader cluster was provisioned with eight brokers, creating a robust environment for future scalability tests. However, by channeling benchmark traffic exclusively through a single topic partition with four replicas, only four brokers were actively involved in the critical path for low-latency validation. This minimal setup was a “lowest baseline first” methodology, enabling meticulous scrutiny of every intermediate delay agent. Once the lowest achievable latency was validated in this controlled environment, the architecture could be confidently expanded—adding partitions to meet business workload needs while retaining the insights around scaling effects.

The hardest part came with optimizing for p99 latency. By definition, the 99th percentile captures the rarest and largest outliers—moments when brokers, disk, or network hiccups can create unpredictable latency. While p95 tuning eliminates most common delays, it does not protect mission-critical financial workflows from those tail events. Thus, configuration changes, hardware upgrades, and monitoring protocols were evaluated not just for average improvements but also for their ability to flatten the latency curve deep into the distribution.

Using OpenMessaging Benchmark for Generate Reproducible Latency Test Results

A pivotal tool throughout was the OpenMessaging Benchmark (OMB), an open source test harness built for scalable, reproducible workload generation and latency/throughput observability. OMB orchestrated real-world traffic patterns, capturing high-fidelity latency histograms, and producing actionable, reproducible insights on every aspect of Kafka performance. OMB's modular framework streamlined testing across configuration and infrastructure changes and upgrades.

For readers interested in replicating or extending these performance tests, OMB is openly documented and supported by a large community. Reference documentation and code can be found at the OpenMessaging Benchmark GitHub repository, and detailed installation and usage guides are available on the repo’s Wiki and README, making it a robust choice for comparative and long-term performance engineering.

The architecture was iteratively improved, not through a single overhaul, but rather by incremental enhancements. These allowed for continuous optimization and adaptation, building on lessons learned to create a robust, efficient, ultra-low latency trading pipeline. This methodical approach ensured system stability and performance as each change was carefully evaluated and integrated.


Implementation and Tuning Details for Ultra-Low Latency

Early OMB benchmarks uncovered classic distributed system headaches under load: offset commit failures, sporadic record expirations, group rebalance anomalies, and latency outliers. The teams followed an iterative loop—measure, diagnose, optimize—with each cycle underpinned by cross-layer metrics.

The five major components of end-to-end Kafka latency include produce-time, publish-time, commit-time, catch-up-time, and fetch-time.

Below, we classify and explain the most important metrics that proved indispensable during this engagement.

Summary of Kafka Observability Metrics Used to Tune for Ultra-Low Latency

Type of Metrics

Insight gained

Metrics Used

Adjustments Made

Broker metrics

Pinpointing cluster bottlenecks

MessagesInPerSec / BytesInPerSec / BytesOutPerSec

RequestsPerSec

RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, ResponseSendTimeMs

UnderReplicatedPartitions and ISR Health

Broker configurations:

num.network.threads = 48

num.io.threads = 96

num.replica.fetchers = 8

log.segment.bytes tuned upward from 1GB

Heap size 64GB

Producer metrics

Exposing send-side latency and batching

request-latency-avg / request-latency-max

batch-size-avg and bufferpool-wait-time

record-send-rate and records-per-request-avg

record-error-rate and record-retry-rate

Producer configuration tuning:

linger.ms = 1

batch.size = 32,000

compression.type = lz4

max.in.flight.requests.per.connection = 1

enable.idempotence=true

acks=all

Consumer metrics

Ensuring true end-to-end delivery

records-consumed-rate / bytes-consumed-rate

fetch-latency-avg / fetch-latency-max

records-lag-max

commit-latency-avg / rebalance-latency-max

Consumer configuration tuning:

max.poll.records = 10,000

max.poll.interval.ms = 200

Host metrics

Validating infrastructure health

CPU utilization and load average

Memory usage and swappiness

Disk I/O and latency

Network throughput and errors

Infrastructure upgrades:

Confluent Platform version 7.8

Upgraded bare metal

XFS file system

vm.swappiness = 1

Generational ZGC (Z Garbage Collector)

JDK 21

Broker Metrics: Pinpointing Cluster Bottlenecks

  • MessagesInPerSec / BytesInPerSec / BytesOutPerSec: These throughput metrics revealed real-time load on each broker, helping us correlate spikes in traffic with latency anomalies and resource saturation.

  • RequestsPerSec: Provided insight into the rate of handled requests, segmented by type (Produce, FetchConsumer, FetchFollower), which is closely linked to CPU and network utilization.

  • RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, ResponseSendTimeMs: This suite of request timing metrics allowed us to break down the lifecycle of each Kafka request. Notably, RemoteTimeMs was critical for identifying delays caused by slow in-sync replicas (ISRs) when using acks=all, while RequestQueueTimeMs and ResponseQueueTimeMs flagged thread or CPU contention.

  • UnderReplicatedPartitions and ISR Health: Monitoring the number of under-replicated partitions and the health of the ISR set was essential for both durability and latency. Any lagging broker in the ISR could inflate p99 latency, making these metrics vital for proactive remediation.

  • RequestQueueSize / ResponseQueueSize: Indicated the depth of queued requests and responses, helping to spot resource exhaustion or thread bottlenecks.

  • NetworkProcessorAvgIdlePercent / RequestHandlerAvgIdlePercent: These idleness metrics helped identify when brokers were running at or near capacity, which could lead to increased queuing and latency.

Producer Metrics: Exposing Send-Side Latency and Batching

  • request-latency-avg / request-latency-max: These metrics quantified both average and worst-case producer request times, directly surfacing the impact of configuration changes and cluster health on end-to-end latency.

  • batch-size-avg and bufferpool-wait-time: By tracking batch sizes and buffer pool utilization, we could fine-tune batching parameters for optimal throughput without introducing queuing delays.

  • record-send-rate and records-per-request-avg: Provided a pulse on production rates and batching effectiveness.

  • record-error-rate and record-retry-rate: Flagged any delivery issues or retries that could cascade into higher tail latency.

Consumer Metrics: Ensuring True End-to-End Delivery

  • records-consumed-rate / bytes-consumed-rate: These metrics confirmed that consumers were keeping pace with the broker, ensuring that downstream lag was not masking upstream improvements.

  • fetch-latency-avg / fetch-latency-max: Measuring fetch latencies at the consumer side was crucial for validating that end-to-end latency goals were being met, not just producer-to-broker.

  • records-lag-max: Highlighted any lag in consumption, which could otherwise introduce hidden tail delays.

  • commit-latency-avg / rebalance-latency-max: These metrics tracked the time spent in offset commits and group rebalances, often the source of rare, high p99 delays.

Host Metrics: Validating Infrastructure Health

  • CPU Utilization and Load Average: High CPU usage or load could indicate thread starvation or resource contention, directly impacting broker and client performance.

  • Memory Usage and Swappiness: Monitoring memory consumption and swap activity (with vm.swappiness set to 1) helped ensure predictable performance and avoid latency spikes due to paging.

  • Disk I/O and Latency: Disk throughput and latency metrics were essential for detecting storage bottlenecks, especially during log segment rolls or under heavy ingestion.

  • Network Throughput and Errors: Network metrics ensured that data center links and broker interconnects were not introducing unpredictable delays or packet loss.

Why Kafka Broker, Producer, Consumer, and Infrastructure Metrics Matter

By focusing on this core set of metrics, performance tuning became a scientific, iterative process. Each metric provided a lens into a specific part of the pipeline, enabling targeted adjustments and validation of every configuration, hardware, or architectural change.

Kafka Producer Tuning

  • linger.ms = 1: Adds up to 1 ms of delay to allow more records to batch together, slightly increasing latency but improving throughput and compression efficiency

  • batch.size = 32,000: Sets the maximum batch size to 32,000 bytes, allowing larger batches for better throughput and compression, but using more memory per partition

  • compression.type = lz4: Enables LZ4 compression for batches, reducing network usage and storage at the cost of some CPU overhead; LZ4 is fast and efficient.

  • max.in.flight.requests.per.connection = 1: Only one unacknowledged request is sent at a time, ensuring strict message ordering and supporting idempotence

  • enable.idempotence=true: Guarantees exactly-once delivery by preventing duplicate messages

  • acks=all: Waits for all in-sync replicas to acknowledge a message before considering it sent, providing the highest durability guarantee

Kafka Consumer Tuning

  • max.poll.records = 10,000: Allows the consumer to fetch up to 10,000 records in a single poll, increasing batch processing efficiency.

  • max.poll.interval.ms = 200: Sets the maximum allowed time (200 ms) between poll calls before the consumer is considered dead

Kafka Broker Configuration Highlights

  • num.network.threads = 48: Increases the number of threads handling network requests, allowing the broker to process more client connections and network traffic in parallel, which can improve throughput for high-connection workloads.

  • num.io.threads = 96: Allocates more threads for disk I/O operations (such as reading/writing data and log segments), enabling higher parallelism for disk-bound tasks and potentially improving performance under heavy load.

  • num.replica.fetchers = 8: Sets the number of threads used by the broker to fetch data from leader partitions for replication. More fetchers can speed up replication and reduce lag between replicas, especially in clusters with many partitions.

  • log.segment.bytes tuned upward from 1GB: Increasing the log segment size reduces the frequency of segment rollovers and file creation, which can minimize blocking and improve write throughput,

  • Heap size 64GB: Allocates a large Java heap for the broker, allowing it to cache more data and handle larger workloads.

Kafka Infrastructure

To address p99 latency spikes caused by JVM garbage collection and disk I/O contention observed during initial benchmarks, the following upgrades were implemented (this framing connects infrastructure choices directly to the problems they solve, as recommended):

  • Confluent Platform version 7.8

  • Upgraded bare metal: 56 cores, 1 TB RAM, 2x3 TB SSDs in RAID 10 configuration enterprise SSDs, yielding reduced jitter and tighter tail latency under extreme throughput.

  • XFS file system: High-performance, designed for scalability and parallelism. Efficiently handles large files, supports high IOPS workloads, and enables fast parallel I/O operations. Particularly effective for Kafka due to minimized file system overhead and reduced locking contention.

  • vm.swappiness = 1: Minimizes the Linux kernel’s tendency to swap application memory to disk, helping ensure that Kafka and other critical processes keep their data in RAM for better performance and lower latency.

  • Generational ZGC (Z Garbage Collector): Provides much lower and more predictable pause times, even with large heap sizes. This reduces latency spikes and helps Kafka brokers maintain high throughput and responsiveness, especially under heavy load or with large memory allocations.

  • JDK 21: Virtual threads and other Java enhancements, including garbage collector and JIT compiler improvements, boost throughput and reduce latency for concurrent applications.

Log in to the Confluent Support portal and learn more in the article, “What Are Possible Kafka Broker Performance Diagnostics Techniques?

Results and Impact – Real-Time Trading With p99 Latency

Through extensive testing involving thousands of permutations, we identified specific non-default parameter values that significantly enhanced performance for this specific deployment and use case, which means we've only focused on these key settings outlined in this blog post. Tests were executed for durations ranging from 10 to 60 minutes. Stress tests, however, were conducted over a full 24-hour period. 

Our approach began by optimizing producer and consumer configurations, followed by fine-tuning broker settings. These initial adjustments successfully met our latency requirements, though occasional spikes still occurred. Our focus then shifted to the underlying infrastructure, which proved crucial in eliminating anomalies and delivering consistent, repeatable performance.

To truly capture the system's behavior before and after tuning, we standardized our benchmarking on a specific set of "Golden Signals" and configuration parameters. Rather than relying on simple averages, we meticulously tracked the following for every test iteration:

  • Granular Latency Percentiles: We moved beyond p95, recording end-to-end (E2E) latency at p50, p75, p95, p99, p99.9, and even p99.99. This spectrum was vital for exposing the "tail" events that define outlier performance.

  • Durability & Safety: We locked in acks=all, min.insync.replicas, and enable.idempotence to guarantee that speed never came at the cost of data safety.

  • Producer & Consumer Tuning: We isolated the impact of specific client-side settings, specifically tracking changes to linger.ms, batch.size, compression.type, max.poll.records, max.poll.interval.ms, and auto.commit.interval.ms.

  • Topology & Workload: Each run explicitly defined the physical constraints, including partition count, broker count, and producer rate, ensuring we could correlate throughput pressure with latency degradation.

Lessons Learned and Best Practices for Achieving p99 Latency With Kafka

Thanks to fellow team members like Kyle Moynihan and Phoebe Bakanas, we were able to meet the customers’ needs and establish the following best practices for tuning order-preserving, high-volume Kafka workloads: 

  • Request Timing Anatomy Unlocks True Diagnosis: While standard metrics reveal system health, dissecting request timing (e.g., RemoteTimeMs) uniquely enables diagnosis of distributed state problems—especially for acks=all deployments where ISR lag is the hidden enemy of p99 latency.

  • Single-Partition Baseline Is Non-Negotiable: Attempting to scale out before understanding the single-partition latency floor introduces confounding variables that make true p99 optimization impossible. Start minimal, validate order and latency, then scale with confidence.

  • Infrastructure Is the Ultimate Bottleneck: No amount of software tuning can compensate for disk, network, or GC-induced tail events. Upgrading to enterprise SSDs, XFS, and ZGC was essential for eliminating rare, unpredictable latency spikes.

  • Scientific, Iterative Tuning Is Essential: Every change—hardware, JVM, configuration—must be validated by its impact on latency percentiles, not just averages. OMB and disciplined monitoring made this possible.

  • Reproducibility Is Power: Open, scenario-driven, use case and deployment-specific benchmarks are the antidote to “point-in-time” test fallacies. They ensure every config or hardware iteration is benchmarked for true impact, not just anecdotal improvement. 

See how you can apply these lessons for your own low-latency use case and environment—download the “Optimizing Your Apache Kafka Deployment” white paper to learn more.


Apache®, Apache Kafka®, Kafka®, and the Kafka logo are registered trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by the Apache Software Foundation is implied by using these marks. All other trademarks are the property of their respective owners.

  • Arvind Rajagopal is a Principal Customer Success Technical Architect at Confluent, helping enterprises design and operate event‑driven data streaming platforms that turn real‑time strategy into production outcomes. He specializes in distributed systems and high‑performance, mission‑critical applications and is drawn to hard architectural challenges and emerging technologies. Outside work, Arvind heads for the mountains and is an avid aquarist.

  • Martin Morcate Trujillo is a Senior Customer Success Technical Architect at Confluent, partnering with enterprise teams to design, optimize, and scale real-time data streaming platforms that deliver measurable business outcomes.

    Drawing on his extensive background in software development, he bridges the gap between developer and platform teams, ensuring organizations get the most value from Confluent’s distributed systems in an efficient, scalable way.

    Outside of work, Martin enjoys the sunny beaches of South Florida and loves traveling with his wife and two kids.

¿Te ha gustado esta publicación? Compártela ahora