Apache CassandraTM, with its support for high volume data ingest and multi-data-center replication, is a popular and preferred landing zone for web-scale data. DataStax Enterprise (DSE), a data management layer built on Apache Cassandra, accelerates the ability to deliver real-time value through its powerful indexing, search, analytics and graph functionality.
Apache KafkaTM is a natural fit for delivering data to a DataStax Enterprise or Apache Cassandra cluster. The enterprise-grade messaging platform simplifies data distribution pipelines and scales to support any number of data sources.
Developing a data pipeline became a lot easier with the release of Kafka Connect (see http://www.confluent.io/blog/how-to-build-a-scalable-etl-pipeline-with-kafka-connect). Recently, the team at DataMountaineer joined forces with Confluent and DataStax to develop the Certified DataStax Connector (http://docs.datamountaineer.com/en/latest/cassandra-sink.html?highlight=cassandra). This sink connector has been certified by Confluent to meet the core requirements of code quality, scalability and usability. The certification process ensures that the connector integrates properly with the Kafka Connect framework: standardizing the configuration options, abstracting message serialization, properly supporting the optional Schema Registry. Certified connectors also have the ability to be configured from the Confluent Control Center.
Under the hood, the Certified DataStax Connector uses the highly scalable and resilient DataStax driver for Apache Cassandra.
Andrew Stevenson, CTO at DataMountaineer and former big data consultant at Barclays, says:
“In reality, big data is a big data flood. A streaming torrent of time series events from IoT, sensors, logs, financial transactions and more. Cassandra, with its high sustained write throughput, high availability and query performance, is the perfect Sink to capture these streams coming out of Kafka. The Kafka Connect DSE Sink partners with these two scalable and high performance platforms to allow automated, scalable, fault tolerant writes to Cassandra.”
Let’s take a quick look at the connector itself, and how it can be used to get the most value out of your streaming data by linking your Confluent Platform Enterprise deployment with your DataStax Enterprise or Apache Cassandra cluster.
The above graphic illustrates how the Certified DataStax Connector is deployed in a typical Confluent Platform cluster. Kafka Connect worker nodes host specialized connectors for a wide range of data sources (legacy RDBMS systems like MySQL, syslog clients, etc.). The Certified DataStax Connector is deployed to store the data produced by those sources.
All Connectors will auto-scale (within configured limits) to the available resources in the Connect Worker pool, and support dynamic reconfiguration. In practical terms, this means that the Certified DataStax Connector will deploy additional tasks should you request additional Kafka topics be stored to the DSE/Cassandra cluster or if those topics grow to a larger number of partitions. The current Connector design is conservative with respect to the Cassandra environment: each Connector instance will write to a single keyspace, mapping the monitored Kafka topics to separate tables within that keyspace. The keyspace and tables must already exist before the Connector begins sending records to the Cassandra cluster. However, the flexibility of Kafka Connect makes it trivial to deploy additional Connector instances should you want to store data to multiple keyspaces.
The basic Connector configuration parameters can be seen here (in a simple properties-file format)
The Kafka Connect framework allows those parameters to be configured at startup (via the properties file) or at run time (via the REST interface directly or the Confluent Control Center GUI).
More importantly, the framework enforces some basic message structure between the Certified DataStax Connector and the Source Connectors publishing to the Kafka cluster. That structure allows you to filter each Kafka message for the specific fields you wish to save to the Cassandra table. The Connector supports a simple query statement to define this transformation (see the export.route.query property above).
An important takeaway from the Connector configuration is the cluster contact specification (contact.points and port). You’ll want to ensure proper network connectivity between ALL the Kafka Connect worker nodes and the specified DSE/Cassandra endpoints. This is because the Kafka Connect framework transparently distributes Connector tasks across all of the worker nodes, so each worker node may need a network path to the Cassandra cluster. The advantage of this distribution, however, is that the Connector tasks will automatically redeploy to an active node in the event of a node failure, ensuring maximum reliability.
Once the Connector is configured and deployed, all data from the configured topics will be written to Cassandra. As more data is published to the specified topics, more records will be written to the Cassandra tables. Your DSE/Cassandra applications will have access to the most up-to-date information, allowing you to get the most business value from it.
The Kafka Connect ecosystem is growing rapidly. Connectors such as the latest Certified DataStax Connector from DataMountaineer and DataStax gives users the flexibility to analyze more of their enterprise data with the latest analytic tools. The vision of a real-time enterprise, where highly scalable, always-on data management layers like DataStax Enterprise extract insights from the most up-to-date information streamed through Confluent Platform, is fast becoming the norm.