Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
This blog post is the second in a series about the Streams API of Apache Kafka, the new stream processing library of the Apache Kafka project, which was introduced in Kafka v0.10.
Current blog posts in the Streams API in Kafka series:
In this post we describe the security features of the Streams API in Kafka. Many use cases and applications — whether it is in the area of stream processing or elsewhere — have tight internal and/or external security requirements. Industries where such requirements are common include finance and healthcare in the private sector but also include governmental services. Legal compliance, for example, may require you to implement certain security measures such as encrypting data-in-transit when you are working on sensitive data such as personally identifiable information. You may also be required to enforce authentication and authorization to limit access to data to only a subset of employees or personnel to adhere to information security policies such as Need-to-know.
The question we want to answer in this blog post is:
Let’s start with a quick answer to this question:
We can now walk through this answer in further detail.
First, which security features are available in Apache Kafka, and thus in the Streams API? The Streams API in Kafka supports all the client-side security features in Apache Kafka. In this short blog post we cannot cover these client-side security features in full detail, so I recommend reading the Kafka Security chapter in the Confluent Platform documentation and our previous blog post Apache Kafka Security 101 to familiarize yourself with the security features that are currently available in Apache Kafka.
That said, let me highlight a couple of important Kafka security features that are essential for implementing robust data infrastructures, whether these are used for building horizontal services at larger companies, for multi-tenant infrastructures (e.g. microservices), or for shared platforms such as in the Internet of Things. Later on I will then demonstrate an example application where we use some of these security features in the Streams API in Kafka.
Kafka security features include:
It’s worth noting that the aforementioned security features in Apache Kafka are optional, and it is up to you to decide whether to enable or disable any of them. And you can mix and match these security features as needed: both secured and non-secured Kafka clusters are supported, as well as a mix of authenticated, unauthenticated, encrypted and non-encrypted clients. This flexibility allows you to model the security functionality in Kafka to match your specific needs, and to make effective cost vs. benefit (read: security vs. convenience/agility) tradeoffs: tighter security requirements in places where security matters (e.g. production), and relaxed requirements in other situations (e.g. development, testing).
Second, how do you use these security features in the Streams API i.e. when building your own stream processing applications? The most important aspect to understand is that the Streams API leverages the standard Kafka producer and consumer clients behind the scenes. Hence what you need to do to secure your stream processing applications is to configure the appropriate security settings of the corresponding Kafka producer/consumer clients. Once you know which client-side security features you want to use, you simply need to include the corresponding settings in the configuration of your Streams API application.
Let’s show a simple example, based our previous blog post Apache Kafka Security 101. What we want to do is to configure our Streams API application to 1. encrypt data-in-transit when communicating with its target Kafka cluster and 2. enable client authentication.
Tip: A complete demo application including step-by-step instructions is available at SecureKafkaStreamsExample.java under https://github.com/confluentinc/examples.
For the sake of brevity, we assume that a. the security setup of the Kafka brokers in the cluster is already completed and b. the necessary SSL certificates are available to your Streams API in Kafka application in the filesystem locations specified below (the aforementioned blog post walks you through the steps to generate them); for example, if you are using Docker to containerize your Streams API in Kafka applications, then you must also include these SSL certificates in the right locations within the Docker image.
Once these two assumptions are met, you must only configure the corresponding settings for the Kafka clients in your Streams API application. The configuration snippet below shows the settings to enable client authentication and enable SSL encryption for data-in-transit between your Streams API application and the Kafka cluster it is reading from and writing to:
Within a Streams API application, you’d use code such as the following to configure these settings in your StreamsConfig instance:
With these settings in place your Streams API in Kafka application will encrypt any data-in-transit that is being read from or written to Kafka, and it will also authenticate itself against the Kafka brokers that it is talking to. (Note that this simple example does not cover client authorization.)
Now what would happen if you misconfigured the security settings in your Kafka Streams application? In this case, the application would fail at runtime, right after you started it. For example, if you entered an incorrect password for the ssl.keystore.password setting, then the following error messages would be logged, and after that the application would terminate:
Similar exceptions would be thrown if you misconfigured other security settings such as ssl.key.password:
Your Operations team can monitor the log files of your Streams API applications for such error messages to spot any misconfigured applications quickly, and to alert the corresponding teams.
In summary, the Streams API in Kafka makes your stream processing applications secure, and it achieves this through its native integration with Apache Kafka’s security functionality.
If you have enjoyed this article, you might want to continue with the following resources to learn more about Apache Kafka’s Streams API:
Tableflow can seamlessly make your Kafka operational data available to your AWS analytics ecosystem with minimal effort, leveraging the capabilities of Confluent Tableflow and Amazon SageMaker Lakehouse.
Building a headless data architecture requires us to identify the work we’re already doing deep inside our data analytics plane, and shift it to the left. Learn the specifics in this blog.