Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

ApacheCon 2015

Written By
  • Jun RaoCo-founder of Confluent and original co-creator of Apache Kafka®

I was at ApacheCon 2015 in Austin, Texas a couple of weeks ago. The following is a short summary of some of the trends that I observed at the conference.

There is a lot of usage and interest in Apache Kafka.

I gave a talk on “Introduction to Apache Kafka Tuesday afternoon. About a quarter of the attendees went to the talk and the room was full. We also hosted a Kafka meetup on Tuesday evening for conference attendees. I provided an overview of the Confluent Platform 1.0 release. In particular, I described Schema Registry and how it can be used with Kafka to captured structured feeds of stream data. Most participants seem to understand the value of having schemas and a well defined policy for schema evolution. On Wednesday morning, Todd Palino from LinkedIn gave a well received talk on Kafka at Scale: a multi-tiered architecture, in which he shared lots of valuable experience of running Kafka in production.

Tony Ng from Ebay gave a talk on Pulsar: Realtime Analytics at Scale leveraging Kafka, Hadoop and Kylin. Pulsar is an open source real time analytics and stream processing framework. Its typical usage includes reporting/dashboards, business activity/monitoring, personalization, marketing/advertising, fraud and bot detection. Currently, it’s processing about 0.5 million events/sec at Ebay. Similar to other stream processing systems, Pulsar supports getting data from different channels including Kafka, file, http, etc. It also has a pluggable processing layer. Pulsar supports SQL on stream data through Esper. One key distinguishing feature I found is that Pulsar integrates with Druid and Apache Kylin, a Hadoop based OLAP engine to build cubes.

There is a growing awareness and interest in the space of realtime data ingestion.

Joe Will gave a talk on Apache NiFi: Better Analytics Demands Better Dataflow. Apache Nifi is a new incubator project and was originally developed at the NSA. In short, it is a data flow management system similar to Apache Camel and Flume. It’s mostly intended for getting data from a source to a sync. It can do light weight processing such as enrichment and conversion, but not heavy duty ETL. One unique feature of Nifi is its built-in UI, which makes the management and the monitoring of the data flow convenient. The whole data flow pipeline can be drawn on a panel. The UI shows statistics such as in/out byte rates, failures and latency in each of the edges in the flow. One can pause and resume the flow in real time in the UI. Nifi’s Architecture is also a bit different from Camel and Flume. There is a master node and many slave nodes. The slaves are running the actual data flow and the master is for monitoring the slaves. Each slave has a web server, a flow controller (thread pool) layer, and a storage layer. All events are persisted to a local content repository. It also stores the lineage information in a separate governance repository, which allows it to trace at the event level. Currently, the fault tolerance story in Nifi is a bit weak. The master is a single point of failure. There is also no redundancy across the slaves. So, if a slave dies, the flow stops until the slave is brought up again.

There are lots of activities in container-based resource management.

Christos Kozyrakis gave a talk on Apache Mesos in which he outlined some of the new development in Mesos. One can now do resource allocation based on the framework roles. For example, one can configure to only give SSDs to database services. It seems that there is a Mesos DNS for service discovery. Christos also showed the results of an interesting analysis. Even with the usage of Mesos, in a data center, typical server cpu utilization is only about 30% most of the time. The reason is due to the curse of over-provisioning. Mesos is trying to improve the server utilization by reporting unused resources to the master as “best effort” resources. Those resources can then be used for low priority tasks. If not done carefully, such an approach may have negative performance impact. For example, the low priority task can pollute L3 cache and cause the latency to increase in realtime applications. Mesos tries to address this problem by using isolators at different levels.

In summary, Kafka had great presence in ApacheCon this year. It’s great to see a lot of interest in Kafka and what Confluent is building. If you like working on Kafka and are interested in helping us build realtime stream processing technologies at Confluent, please let us know.

  • Jun Rao is the co-founder of Confluent, a company that provides a stream data platform on top of Apache Kafka. Before Confluent, Jun Rao was a senior staff engineer at LinkedIn where he led the development of Kafka. Before LinkedIn, Jun Rao was a researcher at IBM's Almaden research data center, where he conducted research on database and distributed systems. Jun Rao is the PMC chair of Apache Kafka and a committer of Apache Cassandra.

Did you like this blog post? Share it now