ApacheCon 2015

ApacheCon 2015

ApacheCon 2015
Jun Rao.

I was at ApacheCon 2015 in Austin, Texas a couple of weeks ago. The following is a short summary of some of the trends that I observed at the conference.

There is a lot of usage and interest in Apache Kafka.

I gave a talk on “Introduction to Apache Kafka Tuesday afternoon. About a quarter of the attendees went to the talk and the room was full. We also hosted a Kafka meetup on Tuesday evening for conference attendees. I provided an overview of the Confluent Platform 1.0 release. In particular, I described Schema Registry and how it can be used with Kafka to captured structured feeds of stream data. Most participants seem to understand the value of having schemas and a well defined policy for schema evolution. On Wednesday morning, Todd Palino from LinkedIn gave a well received talk on Kafka at Scale: a multi-tiered architecture, in which he shared lots of valuable experience of running Kafka in production.

Tony Ng from Ebay gave a talk on Pulsar: Realtime Analytics at Scale leveraging Kafka, Hadoop and Kylin. Pulsar is an open source real time analytics and stream processing framework. Its typical usage includes reporting/dashboards, business activity/monitoring, personalization, marketing/advertising, fraud and bot detection. Currently, it’s processing about 0.5 million events/sec at Ebay. Similar to other stream processing systems, Pulsar supports getting data from different channels including Kafka, file, http, etc. It also has a pluggable processing layer. Pulsar supports SQL on stream data through Esper. One key distinguishing feature I found is that Pulsar integrates with Druid and Apache Kylin, a Hadoop based OLAP engine to build cubes.

There is a growing awareness and interest in the space of realtime data ingestion.

Joe Will gave a talk on Apache NiFi: Better Analytics Demands Better Dataflow. Apache Nifi is a new incubator project and was originally developed at the NSA. In short, it is a data flow management system similar to Apache Camel and Flume. It’s mostly intended for getting data from a source to a sync. It can do light weight processing such as enrichment and conversion, but not heavy duty ETL. One unique feature of Nifi is its built-in UI, which makes the management and the monitoring of the data flow convenient. The whole data flow pipeline can be drawn on a panel. The UI shows statistics such as in/out byte rates, failures and latency in each of the edges in the flow. One can pause and resume the flow in real time in the UI. Nifi’s Architecture is also a bit different from Camel and Flume. There is a master node and many slave nodes. The slaves are running the actual data flow and the master is for monitoring the slaves. Each slave has a web server, a flow controller (thread pool) layer, and a storage layer. All events are persisted to a local content repository. It also stores the lineage information in a separate governance repository, which allows it to trace at the event level. Currently, the fault tolerance story in Nifi is a bit weak. The master is a single point of failure. There is also no redundancy across the slaves. So, if a slave dies, the flow stops until the slave is brought up again.

There are lots of activities in container-based resource management.

Christos Kozyrakis gave a talk on Apache Mesos in which he outlined some of the new development in Mesos. One can now do resource allocation based on the framework roles. For example, one can configure to only give SSDs to database services. It seems that there is a Mesos DNS for service discovery. Christos also showed the results of an interesting analysis. Even with the usage of Mesos, in a data center, typical server cpu utilization is only about 30% most of the time. The reason is due to the curse of over-provisioning. Mesos is trying to improve the server utilization by reporting unused resources to the master as “best effort” resources. Those resources can then be used for low priority tasks. If not done carefully, such an approach may have negative performance impact. For example, the low priority task can pollute L3 cache and cause the latency to increase in realtime applications. Mesos tries to address this problem by using isolators at different levels.

In summary, Kafka had great presence in ApacheCon this year. It’s great to see a lot of interest in Kafka and what Confluent is building. If you like working on Kafka and are interested in helping us build realtime stream processing technologies at Confluent, please let us know.

Subscribe to the Confluent Blog

Email *

More Articles Like This

David Tucker

Announcing the Certified DataStax Connector for Confluent Platform

David Tucker . .

Apache CassandraTM, with its support for high volume data ingest and multi-data-center replication, is a popular and preferred landing zone for web-scale data. DataStax Enterprise (DSE), a data management layer ...

data center cloud
Jay Kreps

Sharing is Caring: Multi-tenancy in Distributed Data Systems

Jay Kreps . .

Most people think that what’s exciting about distributed systems is the ability to support really big applications. When you read a blog about a new distributed database, it usually talks ...

Matthias J Sax

Data Reprocessing with the Streams API in Kafka: Resetting a Streams Application

Matthias J Sax . .

This blog post is the third in a series about the Streams API in Kafka, the new stream processing library of the Apache Kafka project, which was introduced in Kafka ...

Leave a Reply

Your email address will not be published. Required fields are marked *

Try Confluent Platform

Download Now