[Webinar] Build Your GenAI Stack with Confluent and AWS | Register Now
Knowledge Graphs are widely used to represent heterogeneous domain knowledge and provide an easy understanding of relationships between domain entities. Instead of traditional SQL databases, knowledge graphs are often stored and queried in graph databases such as Neo4j.
In this talk, we introduce a system we built on top of Apache Kafka for continuously aggregating entities and relationships harvested from different data streams. The aggregated edge stream is enriched with measures, such as entity co-occurrence, and then delivered efficiently to Neo4j using Kafka Connect.
Graph databases reach their limitations when aggregating and updating a large number of entities and relationships. We therefore frontload aggregation steps using stateful Kafka Streams applications. This battle-tested technology allows scalable aggregation of large data streams in near real-time.
With this, our approach furthermore allows implementing complex measures with Kafka Streams that are not yet supported in graph databases, for example, quantifying the regularity of entity co-occurrence in data streams over a long period of time.
We will demonstrate our system with real-world examples, such as mining taxi traffic in NYC. We focus on our Kafka Streams implementation to compute complex measures and discuss the learnings that lead to our design decisions.