Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Log Compaction | Highlights in the Kafka and Stream Processing Community | March 2016

Written By

It was another productive month in the Apache Kafka community. Many of the KIPs that were under active discussion in the last Log Compaction were implemented, reviewed, and merged into Apache Kafka. Last month’s activities also included a patch release for Kafka 0.9 and the beginning of a plan for the next release.

  • Breaking news! Apache Kafka 0.9.0.1 and Confluent Platform 2.0.1 were released. These are the first patch releases for Apache Kafka 0.9 and Confluent Platform 2.0 respectively. We highly recommend upgrading Kafka clusters to this release to enjoy the large number of fixes and improvements.
  • The Apache Kafka community voted to designate the next release 0.10.0. We originally planned on having 0.9.1 as the next release, but with a protocol update, a new file format, and the much anticipated Kafka Streams processing library, we felt that a bigger change in the version number was warranted to reflect the big improvements.
  • Pull request implementing KIP-41 was merged. KIP-41 is a small but important improvement to the new consumer API. It allows consumers to limit the number of records returned when polling Kafka.
  • Pull request implementing KIP-32 was merged. KIP-32 adds a timestamp to Kafka’s message format. The timestamp allows producers to record the physical time a message was produced and allows consumers to use the timestamps when processing messages This change also paves the way to KIP-33, which adds a timestamp-based index, allowing users to lookup messages by timestamp.
  • Pull requests implementing KIP-42 were merged. KIP-42 adds client interceptors to Kafka. These interceptors, running within the producers and consumers, will let administrators “inject” code that will listen to messages that were produced (immediately before they are sent to Kafka) or to messages that are consumed (immediately before the consumer sees them) and then either record metrics and metadata regarding the messages or apply modifications to the messages. This is a very powerful feature and we are excited about possible use cases.
  • There are few new KIPs under discussion, and you are welcome to join in:
    • KIP-47 – A proposal to use the newly added timestamps from KIP-32 and the time-based indexes that will be added in KIP-33 to allow deleting of all messages with timestamp older than a specified timestamp (for example, delete all messages written before Jan 14th, 2:59pm). This will allow the deleting of data only when the client application determines it is no longer required.
    • KIP-48 – Proposal to add support for delegation tokens to Kafka. Delegation tokens are an authentication mechanism used in Apache Hadoop and allows authenticating large number of clients without increasing load on a central KDC server. Delegation tokens also help processing frameworks distribute work without dealing with distributing keytabs – an application master authenticates and distributes delegation tokens to the tasks that it starts.
  • The Kafka Connector Hub is Live! With 12 connectors and growing, Kafka Connect is becoming key to scalable data integration.
  • Microsoft surprised everyone by open sourcing their .NET Kafka client and sharing it on Github. This is clearly not something Jay Kreps ever expected.
  • The Kafka community in the Bay Area joined together for a meetup hosted by LinkedIn. It was great to meet new and familiar members of the community and to learn how different companies are using Kafka. LinkedIn generously shared a recording of the event, and you can also find the individual slide decks:
    • AirBnB explained how they built their processing pipeline, nicknamed Jitney.
    • LinkedIn showed how they are handling large messages.
    • SignalFx showed how they created an ultra high performance Kafka consumer.
  • Both Apache Storm and Apache Spark communities shared their visions for the future of real-time data processing. Important reading for those keeping up to date on stream processing.
  • Early release of — Kafka: The Definitive Guide is out. Whether you are new to Kafka and looking for first steps, or an experienced user but need some production advice or instructions on how to use new APIs, this book is right for you.
  • Remember to register for the first ever Kafka Summit which takes place on Tuesday, April 26th in San Francisco. Engage with experts, core committers, and leading production users of Apache Kafka, stream processing frameworks, and related projects. Attendees will learn where the Kafka project development is headed and how companies are using Kafka for large-scale, real-time data integration, and stream processing in a variety of applications. There is special pricing with the Hilton hotel until March 30th so also remember to book your room if you decide to go.
  • For those new to Kafka, consider attending the “Introduction to Apache Kafka” half-day tutorial, taking place on the Monday afternoon before the conference. Register or learn more.
  • Confluent’s Kafka training class for developers being offered in San Francisco on April 27-29 (same week as Kafka Summit) is sold out. If you’re interested in this class you are encouraged to get on the waitlist in case of a cancellation or in the event that another class can be added.
  • Confluent’s Kafka operations training class (April 27-28, San Francisco) also sold out. However a second class was added and there are only a few spaces left, so register soon!

That’s all for this month! Since the community is far too active for one person to keep track of all the activities, I’d like to thank Ismael Juma for helping to collect the activities and keep track of the status of the many on-going KIPs.

  • Gwen Shapira is a Software Enginner at Confluent. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. She currently specialises in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, an author of books including “Kafka, the Definitive Guide”, and a frequent presenter at data related conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.

Did you like this blog post? Share it now