Kafka is one of the most important foundation services at Zendesk. It became even more crucial with the introduction of Global Event Bus which my team built to propagate events between Kafka clusters hosted at different parts of the world and between different products. As part of its rollout, we had to add mTLS support in all of our Kafka Clusters (we have quite a few of them), this was to make propagation of events between clusters hosted at different parts of the world secure. It was quite a journey, but we eventually built a solution that is working well for us.
Things I will be sharing as part of the talk:
1. Establishing the use case/problem we were trying to solve (why we needed mTLS)
2. Building a Certificate Authority with open source tools (with self-signed Root CA)
3. Building helper components to generate certificates automatically and regenerate them before they expire (helps using a shorter TTL (Time To Live) which is good security practice) for both Kafka Clients and Brokers
4. Hot reloading regenerated certificates on Kafka brokers without downtime
5. What we built to rotate the self-signed root CA without downtime as well across the board
6. Monitoring and alerts on TTL of certificates
7. Performance impact of using TLS (along with why TLS affects kafka’s performance)
8. What we are doing to drive adoption of mTLS for existing Kafka clients using PLAINTEXT protocol by making onboarding easier
9. How this will become a base for other features we want, eg ACL, Rate Limiting (by using the principal from the TLS certificate as Identity of clients)
Takeaways I can think of now includes:
1. While building this I had the expectation that I will find lots of information about mTLS in Kafka on public internet, but to my surprise there isn’t much information online in this area (there’s blogs on how to configure Kafka with TLS, but not on how to build the Public Key Infrastructure supporting it), so I think this will inspire others to try and implement mTLS in their Kafka clusters
2. This will hopefully convince the audience, it’s not that hard to build a PKI infrastructure from scratch using open source tools and add mTLS support in Kafka clusters, which I think lots of people don’t try/give much thought
3. For those who will be interested in implementing something similar, will have a pretty good overview of the entire solution, so should be relatively easy for others to just follow what we did to build theirs
4. Automated certificate rotation is considered hard (Self-Signed Root CA certificate rotation is something even harder and people mostly ignore), so most usually either resort to using longer TTLs, this will hopefully show a way of automated regeneration of certificate both for brokers and clients without service disruption. So this will encourage a good security practice. Will also show a semi-automated way of rotating Root CA Certs without downtime.
5. I think the way we are approaching the wider adoption of mTLS for existing clients using PLAINTEXT is a good way of doing it. We have tried to make it as quick and easy as possible for product engineers to implement it by providing out of the box tools (as part of the foundation team, us doing the heavy lifting for product engineers means they can spend more time on building features for customers while still following best practices and staying secure). This might inspire others. Because in many cases, tools/solutions lack adoptions despite them adding value.