A Guide to Kafka
Optimizations and
Benchmarks
Authors
Debashis Paul
Cloud Solutions Engineer
Roberto Baturoni
Cloud Solutions Engineer
Collaborators
David Shade
Cloud Solutions Architect
William Fowler
Cloud Solutions Architect
Jun Chen
Cloud Software Development Engineer
Sunny Wang
Cloud Software Development Engineer
Acknowledgments
Marco Carlo Changho
Performance Marketing Engineer
Padma Apparao
Principal Engineer, AI and Cloud Performance Architect
Andres Mejia
AI Software Development Engineer
Murali Madhanagopal
Cloud and AI Architect
Suleyman Sair, PhD
Principal Engineer, Cloud Software
A Guide to Kafka Optimizations and Benchmarks 2
Table of Contents
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6. Intel’s Continued Innovation to Optimize
Apache Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Intel Crypto for Kafka Encryption Acceleration. . . . . . . . 4
Appendix A – Configurations. . . . . . . . . . . . . . . . . . . . . . . . . . .14
Optimize Apache Kafka Streaming . . . . . . . . . . . . . . . . . . . . . 5
Table 1.1: Hardware Configuration
Performance Benchmark Test . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Used for this Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Workload Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Table 1.2: Hardware Configuration
Process/Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Used for this Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
KPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Table 1.3: Configurations of the Intel Bare Metal
Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3rd Gen Intel Xeon Scalable Processor. . . . . . . . . . . . . . . . 16
1. Intel Gen-to-Gen Kafka Table 2.1: Software and Workload Used for
Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 this Testing and Kafka Configuration Used
for this Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1 Intel Gen-to-Gen AWS m6i (3rd Gen) vs.
m5 (2nd Gen) Encryption OFF. . . . . . . . . . . . . . . . . . . . . . . .7 Table 2.2: Software and Workload Used for
this Testing and Kafka Configuration Used
1.2 Intel Gen-to-Gen AWS m6i (3rd Gen) vs.
for this Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
m5 (2nd Gen) Encryption ON. . . . . . . . . . . . . . . . . . . . . . . . . 7
Table 2.3: Software and Workload Used for
1.3 Intel Gen-to-Gen AWS m6i (3rd Gen) vs.
this Testing and Kafka Configuration Used
m5 (2nd Gen) with Compression . . . . . . . . . . . . . . . . . . . . . 8
for this Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2. Kafka Performance – CPU Scaling AWS i4i . . . . . . . . . 9
Table 2.4: Kafka Configuration Used for
2.1 Intel AWS i4i Instances (LZ4 compression). . . . . . 9 this Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3. Kafka Encryption Performance Across
Java Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Kafka Throughput in Intel AWS m6i (3rd Gen)
vs. m5 (2nd Gen) Across JDK and Encryptions . . . . 10
3.2 Kafka Throughput and Latency on Intel AWS
i4i.4xlarge Across JDK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4. Kafka Compression in Intel AWS Instances . . . . . . . . . 11
4.1 Kafka Compression Performance
Comparison AWS m6i.4xlarge . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Kafka Compression
Optimizations on Intel Libraries. . . . . . . . . . . . . . . . . . . . . . .12
5. Intel’s Contributions on Open
Source Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
OpenJDK – Upstream/Backport
Support for VAES Crypto. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Kafka Community – TLS Regression. . . . . . . . . . . . . . . . 13
OpenSSL – SSL & TLS with Several Cryptographic
Functions Including AES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Intel® Storage Acceleration Library (ISA-L) . . . . . . . . 13
Intel® Integrated Performance
Primitives (Intel® IPP)
Cryptography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A Guide to Kafka Optimizations and Benchmarks 3
Overview
Apache Kafka is an open-source Distributed Event streaming platform used
for high-performance data pipelines, streaming analytics, data integration,
and mission-critical applications. It is a Publish-Subscribe real-time messaging
system to process data in a resilient, fault tolerant, horizontally scalable way.
Because Kafka is a high volume and low latency message broker, we need a
fast (but still secure) encryption algorithm capable of encrypting an arbitrary
amount of data. Kafka Producer must encrypt the messages before pushing
them over the network into Kafka Consumer, which then needs to decrypt them
upon retrieval. Kafka supports the encryptions using Transport Layer Security
(TLS). Enabling TLS causes performance impact due to encryption overhead.
Apache Kafka does not directly support any form of encryption-at-rest for data
stored at a broker.
This handbook will dive into several use cases to show Kafka workload
optimizations in 3rd Generation Intel Xeon Scalable CPUs to accelerate the
encryption process through hardware across different compression
methods against different JDK versions.
Intel® Crypto for Kafka Encryption Acceleration
New 3rd Gen Intel Xeon Scalable processors, reducing the CPU overhead due to encryptions. TLS
introduce enhanced cryptographic operations called Cryptography protocols uses AES-GCM cipher suites
Crypto accelerations contributing to improved to optimize Kafka Broker Throughput performance and
performance for Apache Kafka workloads where reducing encryption overhead without impacting the
encryption and decryption are enabled. The new latency SLA.
Crypto instructions set supports implementation of
Intel has also developed compression plug-in solution
stronger encryption protocols without compromising
as an extension of certain compression algorithm,
performance by reducing compute cycles allocated
improving Throughput, latency, and compression ratios
for cryptography processing.
for Kafka workload.
As high volumes of data will be encrypted, symmetric key
encryption is the natural choice for efficiently ensuring
the confidentiality of stored Kafka topic messages. The
Advanced Encryption Standard (AES) is an efficient Lead Broker
Intel Crypto Intel Crypto
symmetric encryption algorithm. Intel® AES instructions
are supported by 3rd Gen Intel Xeon Scalable processor
(Code name: Ice Lake) Vector Advanced Encryption
Standard (VAES) for faster processing of cryptographic
algorithms, constant time encryption, and resilience
Producer Follower Broker
to certain side-channel attacks. The new AES-NI
instruction set is comprised of six new instructions
that perform several compute intensive parts of the Intel Crypto Intel Crypto
AES algorithm. These instructions can execute using
significantly less clock cycles than a software solution.
Galois/Counter Mode (GCM) is an authenticated
encryption mode for block ciphers. AES-GCM is not
only efficient and secure, but hardware implementations
can achieve high speeds with low cost and low latency,
because the mode can be pipelined. Applications that Consumer ZooKeeper
require high data Throughput can benefit from these
high-speed implementations. AES-GCM is optimized
with newest JDK software e.g., OpenJDK 11.0.11 and later
which leverages power of VAES and VPCLMULQDQ
instructions from Intel® AVX-512 instruction set family
to accelerate the Kafka streaming performance while
A Guide to Kafka Optimizations and Benchmarks 4
Optimize Apache Kafka Streaming Performance Benchmark Test
Optimizing the Kafka streaming performance is a key For capacity planning tied to SLA requirement, it is
challenge for any enterprise for better SLA, increased important to run the benchmark testing to achieve the
TCO, better user experience, and satisfying the optimized throughput and best user experience as
compliance requirements. different environments, workloads, and use-cases have
specific needs.
Below are several ways to achieve better Kafka
performance on Intel® platforms: The below benchmark test cases were performed across
different Intel Xeon-SP generations CPUs, JDK versions,
• Switching to the latest generation Intel Xeon processor
compression methods, and encryption scenarios in AWS
to leverage Advanced Crypto accelerations
cloud instances. For test cases executed on specific
• Scaling to choose higher vCPU Intel-based instances AWS instances and related Hardware configurations,
• Use optimized JDK versions to benefit from Crypto Kafka software, and workload configurations refer to
upstream features Appendix A.
• Use best Compression methods based on use cases Workload Architecture
• Use OpenSSL TLS or advanced JDK SSL features This workload is measuring Apache Kafka's streaming
• Use Intel Libraries for better Compressions based performance by utilizing the built-in standard application
on use cases tool. Currently, the test case measures Apache Kafka
Producer and Consumer performance. Intel has used
the diagram below as Kafka benchmarking framework
for testing. The workload executed using standard
embedded scripts ‘kafka-producer-perf-test.sh’ and
‘kafka-consumer-perf-test.sh’ for performance harness.
Apache ZooKeeper and
Apache Kafka Server
Throughput Node 1
measured between
server and producers
K8s Cluster
Producer 1 Producer 2 Producer N Consumer 1 Consumer 2 Consumer N
Node 2 Node 3
A Guide to Kafka Optimizations and Benchmarks 5
Process/Methodology
Benchmark process executed in Intel internal framework Encryption ON
with below mentioned steps:
• Perform Kafka Producer – publish millions of messages
per thread to Broker Kafka server
• Perform Kafka Consumer – read millions of subscribed
33% 12%
Throughput Latency
messages per thread
improvement with improvement vs.
• The workload contains three docker images: 3rd Gen Ice Lake older generation (m5)
– Producer (generate and send messages to Kafka instances (m6i)
and ZooKeeper server)
– Kafka-ZooKeeper-server (receive messages from
Producer and send messages to Consumer)
– Consumer (get messages from Kafka and KPI – Key Performance Indicators
ZooKeeper server) This benchmark results focuses on two KPIs:
• Three Kubernetes worker nodes used for this test #1 – Max Throughput (in MB/second) which measures
case to host Producer, Broker, and Consumer the sum of Producer messages that arrive to Broker
containers in PODs. within a specific amount of time.
• Each POD is assigned to each Kubernetes node using #2 – Producer P99 Latency—the time it takes for a
Anti-affinity setup. Producer has 1 POD, Broker has 1 record produced to Kafka to be fetched by the Consumer.
POD, and Consumer has 1 POD. Testing is done with P99 Latency is standard tail latency measures how much
Replication factor 1 with 1 partition. end-to-end latency 99th percentile of time.
• Median value of three Runs taken for Max Throughput
and P99 Latency to avoid outliers. Software Stack
• Measure the p99 Latency and aggregate transmitted Any change in configurations have been called out in
Throughput Producer to Broker. individual optimizations area to override:
3 Node
K8s
Cluster
1POD 1POD 1POD
SW
N 1 1 N
Kafka Server
Producer and ZooKeeper Consumer
OpenJDK 8 (8u331-b09)/
Kafka 3.2/3.0/2.8.1 ZooKeeper 3.7.0 Python 3.10.1
11.0.15/17.0.1
Ubuntu 20.04.4 LTS (Kernel 5.13.0-11019-aws), Ubuntu 22.04.1 LTS (Kernel 5.15.0-1019-aws)
OS CentOS Linux 7 (Core)
HW 3rd gen Intel® Xeon® Scalable processor (Ice Lake)
A Guide to Kafka Optimizations and Benchmarks 6
1. Intel Gen-to-Gen Kafka Performance Comparison
Both Throughput and Latency KPI depends on choice
of hardware or cloud provider so it is important to
understand what hardware acceleration and software Gen-to-Gen m6i vs. m5 – JDK11
can help to achieve your specific latency goals in your Encryption Off
unique environment.
1.4 Higher is better
In the below test Kafka performance comparison done
1.3
between 3rd Gen Intel Xeon Scalable processor instances Lower is better
1.2
and 2nd Gen Intel Xeon Scalable processor instances in
1.03
Amazon AWS cloud m5, m6i, and i4i instances. It shows 1.0
1 1
the performance difference across storage, compute,
and memory optimized AWS instances. In the sections 0.8
below, any changes made to the baseline configurations
are called out. 0.6
0.4
1.1: Intel Gen-2-Gen AWS m6i (3rd gen) vs.
m5 (2nd gen) – Encryption OFF 0.2
The performance test is done on 2nd Gen Intel Xeon
0
Scalable processor m5.4xl (16 vCPU) vs. 3rd Gen Intel Max Throughput P99 Latency
Xeon Scalable processor-based m6i.4xl (16 vCPU) in (MB/s) (ms)
Open JDK 11.0.15 version while Kafka Encryption config
m5.4xlarge m6i.4xlarge
is turned off.
Performance Summary Figure 1: Intel® AWS m6i (3rd Gen) vs. m5 (2nd Gen) –
Encryption OFF.
3rd Gen Intel Xeon Scalable processor instances
(m6i) shows 30% Throughput improvement vs. older
generation (m5); for Throughput: higher is better and
for P99 Latency: lower is better. See Table 1.1 and 2.1
from the Appendix A for configuration details. Gen-to-Gen m6i vs. m5 – JDK11
Encryption On
1.2: Intel Gen-2-Gen AWS m6i (3rd gen) vs.
m5 (2nd gen) – Encryption ON 1.4 Higher is better
1.33
The performance test is done on 2nd Gen Intel Xeon 1.2
Scalable processor m5.4xl (16 vCPU) vs. 3rd Gen Intel Lower is better
1
Xeon Scalable processor-based m6i.4xl (16 vCPU) in 1.0 1
0.88
Open JDK 11.0.15 version while Kafka Encryption config
is turned on. 0.8
Performance Summary 0.6
3rd Gen Intel Xeon Scalable processor instances (m6i) 0.4
shows 33% Throughput improvement and 12% Latency
improvement vs. older generation (m5). Improvement 0.2
because of Intel VAES crypto instructions for 3rd gen;
for Throughput: higher is better and for P99 Latency: 0
Max Throughput P99 Latency
lower is better. See Table 1.1 and 2.1 from the Appendix A (MB/s) (ms)
for configuration details.
m5.4xlarge m6i.4xlarge
Figure 2: Intel® AWS m6i (3rd Gen) vs. m5 (2nd Gen) –
Encryption ON.
A Guide to Kafka Optimizations and Benchmarks 7
1.3: Intel Gen-2-Gen AWS m6i (3rd Gen) vs.
m5 (2nd Gen) with Compression
The performance test is done on 2nd Gen Intel Xeon
Scalable processor m5.4xl (16 vCPU) vs. 3rd Gen Intel
Xeon Scalable processor m6i.4xl (16 vCPU) in Open
JDK 11.0.15 version.
The following images will show the optimization
on Throughput and latency across two different
compression methods Zstd and LZ4.
Performance Summary
• AWS m6i.4xlarge instance shows 35% Throughput
improvement in Zstd and 30% Throughput
improvement in LZ4 against AWS m5.4xl instance.
• AWS m6i.4xlarge instance shows 12% Latency
improvement in Zstd and 36% Latency improvement
in LZ4 against AWS m5.4xl instance; for Throughput:
higher is better and for P99 Latency: lower is better.
See Table 1.1 and 2.1 from Appendix A for configuration
details.
Compression Method Zstd Compression Method LZ4
1.5 Higher is better 1.5 Higher is better
1.35 1.30
Lower is better Lower is better
1 1 1 1
1.0 1.0
0.84
0.64
0.5 0.5
0 0
Max Throughput P99 Latency Max Throughput P99 Latency
(MB/s) (ms) (MB/s) (ms)
m5.4xlarge m6i.4xlarge m5.4xlarge m6i.4xlarge
Figure 3: Intel® AWS m6i (3rd Gen) vs. m5 (2nd Gen) – Zstd / LZ4 compression.
A Guide to Kafka Optimizations and Benchmarks 8
2. Kafka Performance – CPU Scaling AWS i4i
Kafka performance comparison done across multiple 3rd The following images will show the optimization on
Gen Intel Xeon Scalable processor instances in Amazon Throughput and Latency for compression methods LZ4.
AWS cloud ‘i4i’ instances. In the test below, number of Performance Summary
Producers, number of Brokers, number of Consumers,
and total partitions also increased linearly (4x). • 3rd Gen Intel Xeon Scalable processor AWS i4i
instance CPUs scaling shows linear % Max Throughput
For e.g., the i4i.4xlarge instance with 16 vCPU having improvement with LZ4 compression. Brokers,
32 Brokers, Consumers, Producers, partitions. Consumers, and partitions also scaled along with
instances. For Throughput: higher is better and for
2.1: Intel® AWS i4i Instances (LZ4 compression) P99 Latency: lower is better. See Table 1.2 and 2.2
The performance test is done on Intel Xeon Ice from Appendix A for configuration details.
Lake i4i.xlarge(4vCPU), i4i.2xlarge(8vCPU),
i4i.4xlarge(16vCPU) in Open JDK 11.0.15 version.
AWS i4i Instances CPU Scaling
5.0
Higher is better
4.0
3.86
3.0
Lower is better
2.08
1.91
2.0
1.50
1 1
1.0
0
Max Throughput (MB/s) P99 Latency (ms)
i4i.xlarge i4i.2xlarge i4i.4xlarge
Figure 4: Intel® AWS i4i Instances scaling (LZ4 compression).
3. Kafka Encryption Performance Across Java Versions
The transparent end-to-end encryption in Kafka done Intel team has upstreamed several Crypto instructions
via Java serializer and de-serializer implementation set supports in OpenJDK version 11.0.11+ via Open-
utilizes Intel VAES Crypto instructions set. 3rd source Java community. This document will provide
Gen Intel Xeon Scalable processor instructions the details on Kafka performance comparison across
supports the operations of Crypto algorithms for different Java versions.
simultaneous execution and a method allowing
parallel processing of multiple independent
databuffers giving the Crypto acceleration
boost of Kafka stream processing performance.
A Guide to Kafka Optimizations and Benchmarks 9
3.1 Kafka Throughput in Intel AWS m6i (3rd Gen) vs. 3.2 Kafka Throughput and Latency
m5 (2nd Gen) Across JDK and Encryptions on Intel AWS i4i.4xlarge Across JDK
Kafka performance comparison done across 3rd Gen Kafka performance comparison done across 3rd Gen
Intel Xeon Scalable processor in Amazon AWS cloud Intel Xeon Scalable processor in Amazon AWS cloud
'm6i.4xlarge' and 'm5.4xlarge' instances across JDK 'i4i.4xlarge' instances across JDK 8 vs. JDK 11 versions
8 vs. JDK 11 versions for different Encryption setting. for compression Zstd while Encryptions are turned on.
JDK 11 and higher version provides the Intel® Crypto JDK 11 and higher provides the Intel Crypto acceleration
acceleration against no Crypto support for JDK 8 hence against no Crypto support for JDK 8 hence attributed to
attributed to significant performance improvements. significant performance improvements.
Performance Summary Performance Summary
3rd Gen Intel Xeon Scalable processors AWS instance 3rd Gen Intel Xeon Scalable Processors AWS instance
shows ~25-30% Throughput improvement against 2nd i4i.4xlarge shows 26% Throughput and 39% Latency
Gen Intel Xeon Scalable processors; for Throughput: improvement JDK 8 to JDK 11 with Encryption ON; for
higher is better and for P99 Latency: lower is better. Throughput: higher is better and for P99 Latency: lower
See Table 1.1 and Table 2.1 from the Appendix A for is better. See Table 1.2 and Table 2.2 from Appendix A for
configuration details. configuration details.
JDK across Encryptions – m5 Baseline Throughput
1.4
1.30 1.33
Higher is better 1.24
1.2
1 1 1
1.0
0.8
0.6
0.4
0.2
0
JDK 8 – Encryption On JDK 11 – Encryption Off JDK 11 – Encryption On
m5.4xlarge m6i.4xlarge
Figure 5: Intel® AWS m6i 3rd Gen vs. m5 2nd Gen – No compression.
JDK11 – Encryption On
Higher is better
1.5
1.26 Lower is better
1
1.0 1
0.61
0.5
0
Throughput Latency
JDK 8 JDK 11
Figure 6: Intel® AWS i4i.4xlarge Latency and Throughput for JDK 8 vs. JDK 11.
A Guide to Kafka Optimizations and Benchmarks 10
4. Kafka Compression in Intel AWS Instances
Compression has a huge significance in Kafka workload 4.1 Kafka Compression Performance Comparison
performance. By default, Kafka messages are not AWS m6i.4xlarge
compressed, compressing data batches improves
Throughput and reduces the load on physical storage Performance on Kafka data across different compression
(with replication it would be even more) plus data algorithms is shown in the below chart. All tests on
transmitted over the network will be reduced. Message m6i.4xlarge have been done with 32 (double the vCPUs)
compression adds latency in the Producer (CPU time Producers/Brokers/Consumers and partitions.
spent compressing the messages) but it is not always Performance Summary
suitable for low-latency applications where the cost of
compression or decompression has zero tolerance. P99 Latency is better when 'gzip' compression is applied
comparing other compression methods. Excluding
From Producers to Broker Throughput with different 'gzip' other compression methods are showing better
compression algorithms inhibits significant difference Throughput; for Throughput: higher is better and for
vs. no compression. P99 Latency: lower is better. See Table 1.1 and 2.3 from
In the below graph Throughput and latency impact in a Appendix A for configuration details.
compressed and non-compressed data is outlined. How
compression varies across different algorithms and
performance impact is shown. Intel has also developed
a plugin solution on top of 'gzip' compression to improve
Latency for Max Throughput.
m6i.4xlarge – Compression Types
2.00
Higher is better
Lower is better
1.50 1.48 1.48 1.48
1.38 1.38
1 1 1.08
1.00
0.50
0
Throughput (MB/s) P99 Latency
gzip snappy zstd lz4
Figure 7: Intel® AWS m6i.4xlarge across compression types – Encryption OFF.
A Guide to Kafka Optimizations and Benchmarks 11
4.2 Kafka Compression Optimizations Performance Summary
on Intel Libraries Intel® Storage Acceleration Library (Intel® ISA-L)
Kafka compression process helps to achieve two things: improves Throughput by 1.47x and Intel IPP improves
Reducing network bandwidth usage and saving disk Throughput by 1.15x comparing Java native gzip Intel
space on Kafka Brokers. However, the tradeoff would ISA-L improves Latency by 34% and IPP improves
be dispatch latency because of higher CPU utilization Latency by 8% comparing Java native gzip; for
due to compression. Gzip is known to have the highest Throughput: higher is better and for P99 Latency:
compression ratios with high CPU usage but slowest lower is better. See Table 1.3 and Table 2.4 from
compression speed (latency). In certain use cases gzip Appendix A for configuration details.
is more desirable cost optimize solution against LZ4,
Zstd, or Snappy.
Intel® Ingenuity Partner Program (Intel® IPP)
multithreaded software library with Zlib interface
improve the default gzip latency. And Intel Intelligent
Storage acceleration library (ISA-L) optimizes the
storage Throughput with functions for RAID, erasure
code, cyclic redundancy check (CRC) functions,
cryptographic hash, encryption, and compression.
The below graph presents Intel solution over native
Java gzip to show performance boost.
Gzip vs. Intel Lib Comparison
1.5
1.47
1.15
1 1
1.0
0.91
0.66
0.5
0
Max Throughput (MB/s) P99 Latency
Java gzip IPP gzip ISA-L
Figure 8: 3rd Gen Intel® Xeon® Scalable processor with Intel Compression library.
A Guide to Kafka Optimizations and Benchmarks 12
5. Intel’s Contributions on Open Source Optimization
OpenJDK – Upstream/Backport Support Intel® Storage Acceleration Library (Intel® ISA-L)
for VAES Crypto Intel ISA-L provides tools to minimize disk space use and
Intel team has contributed to OpenJDK community so maximize storage Throughput, security, and resilience.
that Java can leverage performance acceleration Crypto Intel ISA-L is a collection of optimized low-level functions
features support from Intel AVX-512 VAES (Vectorized targeting storage applications. Intel ISA-L helps improve
Advanced Encryption Standard) instruction set in 3rd compression and Throughput performance and reduce
Gen Intel Xeon Scalable processors. latency for a storage application with erasure coding
that uses Reed-Solomon error correction. Intel ISA-L
Intel team also backported several Crypto/Hash proves increase gzip compression performance (better
acceleration support features from future JDK Throughput) using Intel implementation called IGZIP.
versions (JDK 12+) to JDK 11 LTS and JDK 11.0.15 which Intel ISA-L Crypto accelerates multi-buffer cryptography
differentiate the overall Java performance in JDK 11 hashes providing better Throughput leveraging vector
to boost performance for numerous Java dependent SIMD instructions set and improved AES ciphers. This
workloads including Kafka. This enhancement is can optimize Kafka performance.
contributed by Intel and sponsored by the hotspot
compiler team.
Intel® Integrated Performance Primitives
(Intel® IPP) Cryptography
Kafka Community – TLS Regression
Intel IPP Cryptography is a secure, fast, and lightweight
Kafka supports TLS for both encryption and
library of building blocks for cryptography, highly
authentication. TLS cryptographic protocol uses
optimized for various Intel CPUs that includes 3rd Gen
AES-GCM which can be CPU intensive. If a server has
Intel Xeon Scalable processor. It optimizes hardware
negotiated TLS 1.3 it must terminate the connection
cryptography instructions support using several Intel®
with an “unexpected message” alert. TLS 1.3 On Kafka
Streaming SIMD versions and various AVX Instructions
2.7 doesn’t support renegotiation creating intermittent
sets. Intel Integrated Performance Primitives which is
disconnections in Brokers before read/write is completed
multi-threaded software library (part of Intel OneAPI
impacting p99 Latency.
toolkit) which shows better Kafka performance using
While working with a customer, Intel engineers found IPP gzip which is the Intel patched version of native Java
the issue for JDK 11 and TLS 1.3 and suggested a fix, gzip solution.
customer applied the fix to resolve the issue which has
been requested to upstream (https://issues.apache.org/
jira/browse/KAFKA-13418) to Kafka community.
6. Intel’s Continued Innovation
OpenSSL – SSL & TLS with Several to Optimize Apache Kafka
Cryptographic Functions Including AES
• Better Kafka performance using JDK 18 optimized
The OpenSSL project provides an open-source
CRC32, interleaved GCM functions on Intel hardware.
implementation of the SSL/TLS protocols and is
a commonly deployed library for SSL/TLS world- • New 4th Gen Intel Xeon Scalable Processor QAT
wide which can be used in Kafka clients and Broker accelerator engine improves Crypto acceleration
communication. Confluent Kafka broadly adopted and data De/compression while offloading the CPU.
OpenSSL for TLS. OpenSSL implementation can • JDK 18 improved Java array copy/clear to use 512-bit
have better performance comparing to JDK SSL. wide vector width instruction for 4th Gen Intel Xeon
Asynchronous OpenSSL is a non-blocking approach Scalable processor.
that supports a parallel-processing model at the • Intel Granulate (https://granulate.io/solutions/intel/)
cryptographic level for SSL/TLS protocols, which application and workload performance optimization
in turn allows for other types of optimizations. This solution saves CPU utilization without any code
capability allows cryptographic transformations to changes. It can reduce costs by up to 60% while
be processed on dedicated hardware engines or on saving CPU up to 25-40%.
separate logical cores. Intel® QuickAssist Technology
(Intel® QAT) Engine on Open SSL can boost the overall
TLS performance.
Intel QAT OpenSSL Engine (QAT_Engine) supports
acceleration for both hardware as well as optimized
software based on vectorized instructions.
A Guide to Kafka Optimizations and Benchmarks 13
Appendix A Configurations
The following tables show the full configuration details for the test environment, platforms, and software.
All performance results are based on these configurations and tests by Intel on April 12, 2022 - Sept. 21, 2022.
Table 1.1: Hardware Configuration Used for this Testing
m6i.4xlarge m5.4xlarge
Manufacturer Amazon EC2 Amazon EC2
Product Name m6i.4xlarge m5.4xlarge
BIOS Version 1 1
Microcode 0xd000331 0x500320a
IRQ Balance Enabled Enabled
CPU Model Intel® Xeon® Platinum 8375C CPU Intel® Xeon® Platinum 8259CL CPU
@ 2.90 GHz @ 2.50 GHz
Base Frequency 2.9 GHz 2.5 GHz
Maximum Frequency 3.5 GHz 3.5 GHz
All-Core Maximum 3.5 GHz 3.1 GHz
Frequency
CPU(s) 16 16
Thread(s) per Core 2 2
Core(s) per Socket 8 8
Socket(s) 1 1
NUMA Node(s) 1 1
Prefetchers DCU HW, DCU IP, DCU HW, DCU IP,
L2 HW, L2 Adj. L2 HW, L2 Adj.
Turbo Enabled Enabled
Frequency 2,899 MHz 2.5 GHz
Max C-State 9 9
Installed Memory 64 GB (1x64 GB DDR4 3,200 MT/s 64 GB (1x64 GB DDR4 2,933 MT/s
[Unknown]) [Unknown])
Huge Pages Size 2,048 kB 2,048 kB
Transparent Huge Pages madvise madvise
Automatic NUMA Balancing Disabled Disabled
NIC Summary 1x Elastic Network Adapter (ENA) 1x Elastic Network Adapter (ENA)
Drive Summary 1x 500G Amazon Elastic Block Store 1x 500G Amazon Elastic Block Store
A Guide to Kafka Optimizations and Benchmarks 14
Table 1.2: Hardware Configuration Used for this Testing
i4i.xlarge i4i.2xlarge i4i.4xlarge
Manufacturer Amazon EC2 Amazon EC2 Amazon EC2
Product Name i4i.xlarge i4i.2xlarge i4i.4xlarge
BIOS Version 1 1 1
Microcode 0xd000331 0xd000331 0xd000331
IRQ Balance Enabled Enabled Enabled
CPU Model Intel® Xeon® Platinum Intel® Xeon® Platinum Intel® Xeon® Platinum
8375C CPU @ 2.90 GHz 8375C CPU @ 2.90 GHz 8375C CPU @ 2.90 GHz
Base Frequency 2.9 GHz 2.9 GHz 2.9 GHz
Maximum Frequency 3.5 GHz 3.5 GHz 3.5 GHz
All-Core Maximum 3.5 GHz 3.5 GHz 3.5 GHz
Frequency
CPU(s) 4 8 16
Thread(s) per Core 2 2 2
Core(s) per Socket 2 4 8
Socket(s) 1 1 1
NUMA Node(s) 1 1 1
Prefetchers DCU HW, DCU IP, DCU HW, DCU IP, DCU HW, DCU IP,
L2 HW, L2 Adj. L2 HW, L2 Adj. L2 HW, L2 Adj.
Turbo Enabled Enabled Enabled
Frequency 2.9 GHz 2.9 GHz 2.9 GHz
Max C-State 9 9 9
Installed Memory 32 GB (1x64 GB DDR4 64 GB (641x64 GB DDR4 128 GB (1x64 GB DDR4
3,200 MT/s [Unknown]) 3,200 MT/s [Unknown]) 3,200 MT/s [Unknown])
Huge Pages Size 2,048 kB 2,048 kB 2,048 kB
Transparent Huge Pages madvise madvise madvise
Automatic NUMA Balancing Disabled Disabled Disabled
NIC Summary 1x Elastic Network Adapter 1x Elastic Network Adapter 1x Elastic Network Adapter
(ENA) (ENA) (ENA)
Drive Summary 1x 500 G Amazon Elastic 1x 500 G Amazon Elastic 1x 500 G Amazon Elastic
Block Store Block Store Block Store, 1x 3.4T
Amazon EC2 NVMe
Instance Storage
A Guide to Kafka Optimizations and Benchmarks 15
Table 1.3: Configurations of the Intel Bare metal 3rd Gen Intel® Xeon® Scalable Processor
HW / SW Configuration for IA Testing (ISA-L and IPP)
OS CentOS Linux 7 (Core)
Kernel 5.13.0+
CPU model ICELAKE – Intel® Xeon® Gold 6348 CPU @ 2.60 GHz
Sockets, Total CPU(s), NUM Count 2, 112, 2
HT, Turbo Boost YES, YES
Memory 1,024 GB (32x32 GB DDR4 3,200 MT/s [3,200 MT/s])
Disk Nvme0n1: 3.7T
Network loopback
BIOS Version 05.01.01
Microcode 0xd0002a0
FWVersion 02.01.00.1127
Kafka 3.0.0
Java JDK 11.0.15
ISA-L 2.30
IPP 2021.4.0
Table 2.1: Software and Workload Used for this Testing
Attribute m6i.4xlarge m5.4xlarge
OS_VER 20.04.4 22.04.1
OS_IMAGE Ubuntu 22.04.1 LTS Ubuntu 22.04.1 LTS
OPENJDK_VER jdk-11.0.15 jdk-11.0.15
OPENJDK_PACKAGE openJDK11U-jdk_x86_linux_ openJDK11U-jdk_x86_linux_
hotspot_11.0.15_10.tar.gz hotspot_11.0.15_10.tar.gz
PYTHON_VER Python-3.10.2 Python-3.10.2
PYTHON_PACKAGE Python-3.10.2.tgz Python-3.10.2.tgz
ZooKeeper 3.7.0 3.7.0
ZOOKEEPER_PACKAGE apache-zookeeper-3.7.0-bin.tar.gz apache-zookeeper-3.7.0-bin.tar.gz
KAFKA 3.2 3.2
KAFKA_PACKAGE kafka_2.12-3.2.0.tgz kafka_2.12-3.2.0.tgz
Kafka Configuration Used for this Testing
REPLICATION_FACTOR 1
PARTITIONS *Twice # of vCPUs
# OF PRODUCERS *Twice # of vCPUs
# OF CONSUMERS *Twice # of vCPUs
NUM_RECORDS 5,000,000
ENCRYPTION TRUE
RECORD_SIZE 1,000
COMPRESSION_TYPE Zstd/LZ4
MESSAGES 10,000,000
CONSUMER_TIMEOUT 600,000
BATCH_SIZE 524,288
LINGER_MS 100
A Guide to Kafka Optimizations and Benchmarks 16
Table 2.2: Software and Workload Used for this Testing
Attribute i4i.xlarge, i4i.2xlarge, i4i.4xlarge
OS_VER 22.04.1
OS_IMAGE Ubuntu 22.04.1 LTS
OPENJDK_VER jdk-11.0.15
OPENJDK_PACKAGE openJDK11U-jdk_x86_linux_hotspot_11.0.15_10.tar.gz
PYTHON_VER Python-3.10.2
PYTHON_PACKAGE Python-3.10.2.tgz
ZooKeeper 3.7.0
ZOOKEEPER_PACKAGE apache-zookeeper-3.7.0-bin.tar.gz
KAFKA 3.2
KAFKA_PACKAGE kafka_2.12-3.2.0.tgz
Kafka Configuration Used for this Testing
REPLICATION_FACTOR 1
PARTITIONS *Twice # of vCPUs
# OF PRODUCERS *Twice # of vCPUs
# OF CONSUMERS *Twice # of vCPUs
NUM_RECORDS 5,000,000
ENCRYPTION TRUE
RECORD_SIZE 1,000
COMPRESSION_TYPE Zstd/LZ4
MESSAGES 10,000,000
CONSUMER_TIMEOUT 600,000
BATCH_SIZE 524,288
LINGER_MS 100
A Guide to Kafka Optimizations and Benchmarks 17
Table 2.3: Software and Workload Used for this Testing
Attribute m6i.4xlarge
OS_VER 20.04.4
OS_IMAGE Ubuntu 20.04.4 LTS
OPENJDK_VER jdk-17.0.1
OPENJDK_PACKAGE openjdk-17.0.1_linux-x64_bin.tar.gz
PYTHON_VER Python-3.10.2
PYTHON_PACKAGE Python-3.10.2.tgz
ZooKeeper 3.7.0
ZOOKEEPER_PACKAGE apache-zookeeper-3.7.0-bin.tar.gz
KAFKA 2.8.1*
KAFKA_PACKAGE kafka_2.12-2.8.1.tgz
Kafka Configuration Used for this Testing
REPLICATION_FACTOR 1
PARTITIONS *Twice # of vCPUs
# OF PRODUCERS *Twice # of vCPUs
# OF CONSUMERS *Twice # of vCPUs
NUM_RECORDS 3,000,000
ENCRYPTION OFF
RECORD_SIZE 1,000
COMPRESSION_TYPE gzip/Zstd/Snappy/LZ4
MESSAGES 2,000,000
CONSUMER_TIMEOUT 600,000
BATCH_SIZE Default
LINGER_MS Default
2.4: Kafka Configuration Used for this Testing
REPLICATION_FACTOR 1
PARTITIONS 1
# OF PRODUCERS 112
# OF CONSUMERS 1
# OF BROKERS 1
NUM_RECORDS 5,000,000
MESSAGES 10000000.0
ENCRYPTION No
RECORD_SIZE 2048
COMPRESSION_TYPE Gzip/IPP Gzip
BATCH_SIZE 524288.0
LINGER_MS 100 ms
Performance varies by use, configuration and other factors. Learn more at www.intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.
See configuration disclosure for additional details.
No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Other names and brands may be claimed as the property of others.
0922/KF/HBD/PDF Please Recycle 353150-001